WEBVTT

00:00:00.000 --> 00:00:03.000
Okay? Hitting record. Alright. Welcome back everybody. Today's day.

00:00:03.000 --> 00:00:04.000
3 of lectures for the May 2023.

00:00:04.000 --> 00:00:11.000
Errors Institute boot camp. Today. We're going to start learning about supervised learning.

00:00:11.000 --> 00:00:16.000
And in particular dive into regression. So remember your lectures.

00:00:16.000 --> 00:00:23.000
You can find them in the. If you have the Github cloned, you'll find them in the repository.

00:00:23.000 --> 00:00:26.000
Once you open up your Jupiter environment. So we're gonna go to lectures.

00:00:26.000 --> 00:00:32.000
And we're gonna start in the supervised learning folder.

00:00:32.000 --> 00:00:33.000
And in there we are going to skip the introduction notebook.

00:00:33.000 --> 00:00:38.000
This is just a notebook that's sort of like what is, you know.

00:00:38.000 --> 00:00:48.000
Here's here's what we're gonna talk about, and we're gonna dive straight into a supervised learning framework will be our first notebook and then after that, we'll look at data, splits.

00:00:48.000 --> 00:00:51.000
And then after that we'll dive into regression.

00:00:51.000 --> 00:00:52.000
So in this notebook I just kind of want to set the ground.

00:00:52.000 --> 00:01:05.000
So lay the groundwork for what is supervis learning, and then setting up sort of a common framework that any supervised learning problem can be laid into.

00:01:05.000 --> 00:01:11.000
And then from there we'll branch out and actually start to learn algorithms.

00:01:11.000 --> 00:01:19.000
So supervised learning the idea here is, you have data and X and y, the X is a collection of features.

00:01:19.000 --> 00:01:22.000
So think of these as like inputs or data that you have about variations that you think you can use to.

00:01:22.000 --> 00:01:39.000
Then predict which will store in a vector called Y, so y can be continuous.

00:01:39.000 --> 00:01:43.000
It can be categorical, it can be binary.

00:01:43.000 --> 00:01:54.000
It's something that we'd like to predict using the data stor in a matrix X and end by m matrix.

00:01:54.000 --> 00:01:59.000
So N. Columns and M. Rows so or sorry flip that, and Rose and M.

00:01:59.000 --> 00:02:11.000
Columns, so I think it should be m, by n, so yeah, so just m columns, which is the number of features, and then nrows, which is the number of observations.

00:02:11.000 --> 00:02:12.000
And this will become more clear as we dive into the algorithms.

00:02:12.000 --> 00:02:32.000
So the framework for Supervisor is that we assume it may not be true, but we're going to assume that the output that we're interested in is equal to a function of the inputs, plus some random noise so F is a function from the M dimensional reals.

00:02:32.000 --> 00:02:40.000
So all of those features, down to the real numbers that we're trying to estimate, and the idea being is, once we estimate F.

00:02:40.000 --> 00:02:51.000
If we do a good enough job we can sort of predict what various values of Y would be given.

00:02:51.000 --> 00:03:10.000
The inputs. So F of X is also known as the systematic information that X is giving about Y, or it's also sometimes referred to as the signal that X is providing about why and then epsilon is sort of random noise that we think of as independent from x so that's one of our

00:03:10.000 --> 00:03:15.000
assumptions, is that the random noise is independent of the observation.

00:03:15.000 --> 00:03:21.000
The shape of that random noise, the distribution depends on the problem you're working on.

00:03:21.000 --> 00:03:25.000
So I think it's easiest to understand this with an example.

00:03:25.000 --> 00:03:30.000
So we're going to just assume that X is a single one dimensional.

00:03:30.000 --> 00:03:35.000
Vector so a column vector and that the relationship is simply linear.

00:03:35.000 --> 00:03:42.000
So y equals x plus random noise. Where the random noise is normally distributed.

00:03:42.000 --> 00:03:49.000
So you're gonna see some code here that might not make sense right now, we're are going to dive into it deeper.

00:03:49.000 --> 00:03:58.000
So this is more for making graphs and illustrating purposes, and then, hopefully, it will become more clear when we dive into specific algorithms.

00:03:58.000 --> 00:04:02.000
So for now don't worry about understanding the code that you're seeing.

00:04:02.000 --> 00:04:07.000
Just worry about the concepts. Okay, so basically what we're saying is in the real world.

00:04:07.000 --> 00:04:12.000
There's this wide that is equal to F of X.

00:04:12.000 --> 00:04:25.000
This is the systematic information. So somewhere out there, there's a variable Y and a variable X, and they are in nature related to this way, ignoring any sort of random perturbation.

00:04:25.000 --> 00:04:26.000
So the idea here is we don't know what this is ahead of time.

00:04:26.000 --> 00:04:34.000
We. But what we can do is we can collect data.

00:04:34.000 --> 00:04:44.000
So we go out into the world, and let's say we collect a hundred observations and sort of this sort of random random draws in a simulation of that.

00:04:44.000 --> 00:04:53.000
So we go out and we collect 100 observations, and then, you know graphically what we're thinking of is in the background with this black line.

00:04:53.000 --> 00:05:04.000
That was the true relationship and the observations represent sort of these random deviations from the true relationship that will always occur in nature.

00:05:04.000 --> 00:05:18.000
And so these blue dots are the things that we have, and we'd like to use these blue dots to make an estimate of the the black line which is the true relationship and so typically we'll use some sort of algorithm for this.

00:05:18.000 --> 00:05:23.000
And so in this particular notebook, that algorithm is linear regression.

00:05:23.000 --> 00:05:28.000
And so you use that to make an estimate.

00:05:28.000 --> 00:05:29.000
And then basically, what we're saying is using these blue dot.

00:05:29.000 --> 00:05:46.000
We made this estimate, which is now the red solid line, and then our hope is that the red, solid line, our estimate is a good approximation of the real world relationship that's represented by the black line.

00:05:46.000 --> 00:06:02.000
And here good tends to mean that the 2, the estimates, is close I know that sounds not very more definitive, but the essay is close. I know that sounds not very more definitive, but the estimate is close, and some sort of distance metric to the actual relationship and so this is the

00:06:02.000 --> 00:06:09.000
process of supervised learning. You assume that there is some sort of true relationship.

00:06:09.000 --> 00:06:21.000
You go out and observe some data to try and then estimate that relationship. And then the hope is that the estimate is close to the true relationship.

00:06:21.000 --> 00:06:22.000
Okay, so Ashley's asking, How did I get the true relationship?

00:06:22.000 --> 00:06:31.000
So this was just a hypothetical. What this is not real real world data.

00:06:31.000 --> 00:06:35.000
We're not, I would say 900%. I don't know.

00:06:35.000 --> 00:06:36.000
Maybe a hundred percent of the time, in the real world with real data.

00:06:36.000 --> 00:06:44.000
You're never gonna know the real relationship. Maybe not a hundred percent of the time.

00:06:44.000 --> 00:06:45.000
Most of the time with real world data. You're not gonna know the real relationship.

00:06:45.000 --> 00:06:52.000
So you're not gonna have the ability to graph this like true relationship line.

00:06:52.000 --> 00:07:07.000
But in this imaginary world, where I'm sort of demonstrating how supervised learning works, I can pretend that I know it ahead of time just to demonstrate the process of supervised learning.

00:07:07.000 --> 00:07:08.000
So in supervised learning there are 2 main modeling goals making predictions.

00:07:08.000 --> 00:07:17.000
And making inferences, making predictions means that you want to produce your estimate.

00:07:17.000 --> 00:07:25.000
Your algorithm so that the predictions you make are as close to the real world observations that you're gonna have as possible.

00:07:25.000 --> 00:07:36.000
So, here, maybe you're less concerned with making explicit and trying to understand the true relationship between things and just producing models that make really good predictions.

00:07:36.000 --> 00:07:41.000
And we'll talk about what it means to be a good prediction as we dive into the course.

00:07:41.000 --> 00:07:54.000
The other goal, a supervised learning techniques is to make inferences, and so the difference between a prediction and an inference is an inference is you're really trying to understand the relationship itself.

00:07:54.000 --> 00:08:07.000
And then describe it in a way that is useful. So describe how changes in X impact volume values of y, and so here the best estimate is typically models that explain the variance.

00:08:07.000 --> 00:08:26.000
And why, while still being parsimonious, so this is more of a statistics, point of view so basically, you're just trying to explain the why given the X, whereas in predictions, you don't actually care if you're able to explain how different values of X impact and values of y you just care

00:08:26.000 --> 00:08:43.000
about making good predictions. So these are not. Sometimes these turn out to be the same type of model, the model that makes the best predictions is also the model that allows you to most easily make inferences, but it's not always the case that the 2 are one in the same and an example comes from

00:08:43.000 --> 00:08:58.000
there's this competition. Years ago called the Netflix Price Competition, where An netflix put up a large prize like a million dollars or something like that for whatever team could improve their recommendation algorithm a certain certain amount.

00:08:58.000 --> 00:08:59.000
And the team that won, and ended up using a model that predicted better over a model that explained better.

00:08:59.000 --> 00:09:15.000
So they had 2 different models, and one of them was better at making explanations as to why people liked different movies or whatever television shows more than.

00:09:15.000 --> 00:09:28.000
But was not as good as making predictions as the one that you could not use to make the the explanations. So that's sort of in this course, we're going to focus this boot camp.

00:09:28.000 --> 00:09:29.000
We're gonna focus on the making predictions side of things.

00:09:29.000 --> 00:09:45.000
Because that's not as touched upon in classical statistics, courses, and then we'll leave it to you to talk about like learn about the making inferences by referring back to more common like statistics, texts, and that sort of thing.

00:09:45.000 --> 00:10:15.000
Okay. So before we move on to actually doing things with code. Are there any questions about sort of this idea of the supervised learning framework?

00:10:16.000 --> 00:10:19.000
Okay.

00:10:19.000 --> 00:10:26.000
So with all of that being said, we're actually gonna say, slightly abstract before we learn an actual algorithm.

00:10:26.000 --> 00:10:31.000
And I wanna talk about a concept known as data splits for predictive modeling.

00:10:31.000 --> 00:10:36.000
So Nope, wrong one. I want the lecture copy.

00:10:36.000 --> 00:10:48.000
Here we go. So there's this idea of splitting your data that you have to do in order to make predictive models and so that's what we're going to talk about here.

00:10:48.000 --> 00:10:52.000
We're gonna talk about. Why you want to split your data set into smaller data sets.

00:10:52.000 --> 00:11:01.000
And then we're also going to talk about the 3 different types of data splits that you'll be making as you train models and try and find the best model.

00:11:01.000 --> 00:11:04.000
Okay. So for a lot of the notebooks, I think almost all of them, I'll be importing a series of packages at the very top.

00:11:04.000 --> 00:11:07.000
So these are numpy. Often pandas will be included.

00:11:07.000 --> 00:11:20.000
Mat, plot, lib, and then from se born the set style function so I can add a a white grid to make things easier to see.

00:11:20.000 --> 00:11:24.000
So I'm just gonna use these like, probably a most of the notebooks to generate data.

00:11:24.000 --> 00:11:26.000
And then to plot data. So you'll always see a code chunk at the top like this.

00:11:26.000 --> 00:11:34.000
That is just because I'm gonna be handling data a lot.

00:11:34.000 --> 00:11:38.000
So I think a reasonable question is like, why the heck would.

00:11:38.000 --> 00:11:39.000
I want to split up my data set. I only have one of those and I might want to use all of them.

00:11:39.000 --> 00:11:48.000
So let's imagine we're doing a predictive modeling project.

00:11:48.000 --> 00:11:53.000
And you know we go out. We randomly collect some data let's say n observations of M features that have N corresponding outputs x comma y.

00:11:53.000 --> 00:12:06.000
And so the goal with any predictive modeling project is to build a model that has the lowest generalization error.

00:12:06.000 --> 00:12:18.000
And so here generalization, error is defined to be the error of the model on a new, randomly collected set, meaning a data set that it was not trained upon.

00:12:18.000 --> 00:12:26.000
So if we fix this data that we collected originally sorry if we fix this new data set.

00:12:26.000 --> 00:12:27.000
So we just have a hypothetical new data set. X, star, y.

00:12:27.000 --> 00:12:35.000
Star. Then we you know the data we collected originally was randomly collected.

00:12:35.000 --> 00:12:42.000
And so we can think of the generalization, error of any particular model that we're training as a random, variable.

00:12:42.000 --> 00:12:49.000
So generalization, error, meaning the error of the model.

00:12:49.000 --> 00:12:56.000
On this new data set. So when we say error and regression, it's going to be something called the mean squared error in classification.

00:12:56.000 --> 00:13:03.000
It might be something like the accuracy. So either way, whatever it is, we're going to call this variable capital.

00:13:03.000 --> 00:13:06.000
G, this is a random, variable. That's the generalization error.

00:13:06.000 --> 00:13:07.000
So the best model for predictive modeling purposes is the one that has the smallest capital.

00:13:07.000 --> 00:13:15.000
G. So it would be nice if we could know something about capital G and its distribution.

00:13:15.000 --> 00:13:30.000
But if we use all of our data from that we collected out in the world to train our model, it's impossible for us to try and get an estimate of this capital.

00:13:30.000 --> 00:13:51.000
G with the data that we have in hand. So typically in the real world, you collect as much data as you can, and then, either because of budget or logistical reasons, it's not practical to go out and collect additional data to then test your model on so usually there's some sort of limitation on the

00:13:51.000 --> 00:13:54.000
data you're able to collect to train a model and then test that model's performance.

00:13:54.000 --> 00:14:16.000
So what you'll do instead is, you'll create data, splits which these data splits are, one will be set aside for training the data, and then the other part of the split will be set aside for testing the performance of the data to sort of simulate the process of going out to getting new

00:14:16.000 --> 00:14:26.000
data and then calculating this generalization error. So that's the idea so we're going to talk about 3 different splitting steps and or strategies that people use in data science and machine learning.

00:14:26.000 --> 00:14:51.000
But before we do that, does anybody have questions about like our rationale as to why we would want to do a data split.

00:14:51.000 --> 00:14:56.000
Okay. So the first split type, we're gonna talk about is called the train test Split.

00:14:56.000 --> 00:15:07.000
And so I know we just had this very long explanation about like, why we want to do a data split so we can estimate this G, so the trained test split is sort of for that.

00:15:07.000 --> 00:15:20.000
But I, when you make this split, you're going to be setting aide a small portion of the data oftentimes like 1015, 2025% of your data you'll set aside and then you don't touch that data.

00:15:20.000 --> 00:15:30.000
Until you've already selected a best model. So the smaller part of your data set the 1015, 2025% it's called the test set.

00:15:30.000 --> 00:15:31.000
And the purpose of this set is to serve as a final stop.

00:15:31.000 --> 00:15:35.000
Get test before you go out and take this model. You've selected and deploy it out in the wild.

00:15:35.000 --> 00:15:46.000
And so oftentimes what can happen is you can be working, maybe not oftentimes.

00:15:46.000 --> 00:15:47.000
But what can happen is, you could be working on a model.

00:15:47.000 --> 00:16:00.000
You can think that it's a really great model, but it turns out that maybe there was like a typo in your code, or there was some data leakage from like whatever process you're using.

00:16:00.000 --> 00:16:01.000
And so you're erroneously choosing, not the best model.

00:16:01.000 --> 00:16:19.000
And so like this test set sort of acts as a sanity check before you then, you know, if you're working in industry, go out and maybe potentially waste a lot of money and resources on a model that's not very good either because of a typo in your code or something like that so

00:16:19.000 --> 00:16:25.000
the test set you set aside until the very end. After you selected your model.

00:16:25.000 --> 00:16:27.000
The training set is then all the data that's left over that you're gonna use for the process of training your data and making model selections.

00:16:27.000 --> 00:16:37.000
So here's sort of an illustration of this so this data split is randomly done.

00:16:37.000 --> 00:16:43.000
So here, this green, rectangle represents all of the data we've collected through our sampling.

00:16:43.000 --> 00:16:56.000
Then we will do a random sampling, so that randomly some portion, the larger portion of the data get set aside as the training set which we're going to use to train and compare our models.

00:16:56.000 --> 00:16:58.000
And then then over here the smaller portion again, 1015, 2025%.

00:16:58.000 --> 00:17:06.000
This is gonna get held out until the final model is chosen.

00:17:06.000 --> 00:17:25.000
So this is done randomly in general. We'll learn in a later notebook that there are sometimes problems where this randomness is going to have to be relaxed, or the way we make our split is gonna have to be a little bit more prescribed than just random, but we'll come to

00:17:25.000 --> 00:17:31.000
that when we, when we get to those notebooks.

00:17:31.000 --> 00:17:34.000
So Kirthan's asking what you said.

00:17:34.000 --> 00:17:35.000
You do not touch the test set until the best model is selected.

00:17:35.000 --> 00:17:37.000
But how do you select the model if you don't have the generalization error?

00:17:37.000 --> 00:17:48.000
So? These are great questions. So again the test set again, as I've noted, potential point of confusion.

00:17:48.000 --> 00:17:54.000
The test set is not used to directly estimate G. There are going to be other splits that are used for that.

00:17:54.000 --> 00:17:55.000
The test set is sort of like a final check on your chosen model.

00:17:55.000 --> 00:18:03.000
So let's say you do an eighty-twenty split that 20% is set aside until you've already selected a model using other methods, and you use it as sort of a sanity check.

00:18:03.000 --> 00:18:22.000
Just to be like, okay, like, this, performance is not wildly out of line with what I observed from my other forms of testing the data.

00:18:22.000 --> 00:18:23.000
So Aziza, asking if we do data augmentation, should we do splitting after that?

00:18:23.000 --> 00:18:27.000
Or should we do data augmentation after we have the training set?

00:18:27.000 --> 00:18:35.000
So it depends upon the pre-processing that you're doing so.

00:18:35.000 --> 00:18:42.000
For instance, so it depends upon the pre processing that you're doing so. For instance, like, let's say you have a column that you want to apply a log transform to. So you just want to take the natural logarithm of that column that you're able to just do at the very beginning before the split.

00:18:42.000 --> 00:18:53.000
But if you're doing something else like, for instance, we'll learn about things like scaling imputation as well as pre-processing steps, like Pca.

00:18:53.000 --> 00:19:06.000
Those do have to do after you've made the trained test split because you cannot allow the test set to influence sort of things that would need to be fit in those processes that would be called data leakage.

00:19:06.000 --> 00:19:15.000
So we'll talk about those more specifically, once we get to those techniques.

00:19:15.000 --> 00:19:19.000
But that's a great question.

00:19:19.000 --> 00:19:32.000
Are there any other questions about the train test? Split?

00:19:32.000 --> 00:19:34.000
Okay.

00:19:34.000 --> 00:19:42.000
So how do I make a trained test split? So one way you could do it is with either the random or the numpy dot.

00:19:42.000 --> 00:19:43.000
Random packages you could do this by hand, but that is tedious.

00:19:43.000 --> 00:19:55.000
So sk learn which stands for sidekit learn has a function called train underscore test under underscore Split.

00:19:55.000 --> 00:19:56.000
So that's what we're gonna use. So let's imagine we have this.

00:19:56.000 --> 00:20:06.000
Imaginary data set where X is a thousand observations of 10 features, and Y is a thousand observations of the output.

00:20:06.000 --> 00:20:10.000
Now we're going to use this function called train test Split.

00:20:10.000 --> 00:20:13.000
So in order to use this function, we have to import it.

00:20:13.000 --> 00:20:19.000
So from sk learn dot model underscore selection.

00:20:19.000 --> 00:20:26.000
So that's where a train test split is stored, which we can see by looking at the documentation link I've provided here.

00:20:26.000 --> 00:20:33.000
We will then import train underscore test underscore split.

00:20:33.000 --> 00:20:37.000
Okay, so once we have that we're going to run train test, split.

00:20:37.000 --> 00:20:38.000
And so I've already started the code for that here, just to save myself time.

00:20:38.000 --> 00:20:48.000
So the way that this works is train test, split. And maybe I know I just said to save myself time.

00:20:48.000 --> 00:20:54.000
But it might make more sense to do it like this. So train test, split you first input your features.

00:20:54.000 --> 00:21:04.000
X. Then you follow by your outputs. Y, the next thing I usually do is, it has an argument called shuffle.

00:21:04.000 --> 00:21:13.000
I'm gonna set this equal to true. So shuffle just ensures that the data is randomly shuffled before the split is made.

00:21:13.000 --> 00:21:21.000
So, so it might do this by default, but I always just like to make sure, because I want my split to be random, and then finally, you can or not.

00:21:21.000 --> 00:21:27.000
Finally, you can also specify the size of the split with an argument called test size.

00:21:27.000 --> 00:21:33.000
So you can either put in a number of rows like 200, or you could put in a fraction.

00:21:33.000 --> 00:21:46.000
So if I put in point 2, this will specify that 20% of the data set should be set aside as a test set and then the last thing that you might want to do is provide a random state.

00:21:46.000 --> 00:21:54.000
So a random underscore state argument is a positive integer, that, for instance, I could make it anything I want.

00:21:54.000 --> 00:22:07.000
So maybe 4, 2, 8, 9. So this integer will set the random state that is used to generate the random state that is used to generate the random state that is used to generate the random state and so this would just ensure that if I use the random state 4 to 8.

00:22:07.000 --> 00:22:13.000
9 and you use the random state 4 to 8, 9. We will still both of us, when we run the code.

00:22:13.000 --> 00:22:18.000
We'll get the same random split. So the splits are made randomly, so the splits are made randomly, but by specifying the random state you ensure that every time you run it you get the same split.

00:22:18.000 --> 00:22:27.000
So what's show what this puts out so you can see that from this that we get a list.

00:22:27.000 --> 00:22:39.000
S, and inside of that list are arrays. And so we have 1, 2, 3, and 4 arrays.

00:22:39.000 --> 00:22:54.000
And so what those arrays are are the training split of the features followed by the test split of the features, followed by the training set of the outputs, followed by the test set of the outputs.

00:22:54.000 --> 00:22:59.000
So when I take this and I copy it.

00:22:59.000 --> 00:23:08.000
And then I paste it here, and I'll move these over. So they're closer.

00:23:08.000 --> 00:23:15.000
When I take this, and then I run it like here. Then it will store these all, and you can check like here is the shape.

00:23:15.000 --> 00:23:20.000
So we did an 80, 20 split right? So X train should have 800 rows.

00:23:20.000 --> 00:23:23.000
X test should have 200 rows. Why changed? Step? 800 observations.

00:23:23.000 --> 00:23:26.000
Y-axis should have 200 rows. Why test should have 200.

00:23:26.000 --> 00:23:27.000
So Mitch is asking, does random State have a default value?

00:23:27.000 --> 00:23:44.000
No, so if you do not specify a random state, I believe it just uses your computers internal clock when the code was run, as as the pseudo random starting state, so it will be different every time.

00:23:44.000 --> 00:23:53.000
So that's why you typically want to include a random state.

00:23:53.000 --> 00:23:58.000
And then I in practice, you'll want to you don't always want to use the same random state throughout all of your coding.

00:23:58.000 --> 00:24:17.000
This could like weird to like. Maybe this leads to like a weird behavior where, like your random state, provides better performance. For, like the training sets, so you know, try and switch it up every now and then.

00:24:17.000 --> 00:24:20.000
Aziz is asking if y is categorical.

00:24:20.000 --> 00:24:25.000
Is there a quick way to guarantee that the train test split keeps the distributions of the classes approximately the same in the train, and test sets.

00:24:25.000 --> 00:24:37.000
Yes, and we are going to learn that when we get to classification, it's called a strategy, a stratify train test split.

00:24:37.000 --> 00:24:44.000
And there is an argument and trained test split to do that.

00:24:44.000 --> 00:24:47.000
Are there any other questions?

00:24:47.000 --> 00:25:08.000
Yeah, I have a question. So when we are training, our model, is it a common practice to like instead of using just one train test split, do it like multiple times, and see that it's if your outcomes are consistent across all those train test splits?

00:25:08.000 --> 00:25:10.000
Without a random state.

00:25:10.000 --> 00:25:18.000
So I think people don't typically do like making like, let's say, 10 train sets.

00:25:18.000 --> 00:25:26.000
And then like training like so yes and no, if that makes sense so there is a thing that we're gonna learn in this notebook.

00:25:26.000 --> 00:25:29.000
That's basically an essence the same as what you're saying.

00:25:29.000 --> 00:25:37.000
It's called a cross validation, and so it will take your training set split that into 10 sets, and do sort of the model fitting, and then you take an average we'll talk about that, and just a little bit.

00:25:37.000 --> 00:25:47.000
But like typically that is done on, it gets confusing, because that's done on the training set.

00:25:47.000 --> 00:25:53.000
And then the test set is still. You don't touch it until the very end.

00:25:53.000 --> 00:25:55.000
Yeah, okay, so I'll leave this exercise for you to try and complete like later tonight or tomorrow.

00:25:55.000 --> 00:26:05.000
So we're gonna talk about now that we have the trained test split settled.

00:26:05.000 --> 00:26:11.000
We're gonna talk about 2 types of splitting that are then used for model comparison and selection.

00:26:11.000 --> 00:26:21.000
So the 2 types. The first one is called a validation set, so a validation set is basically just a retread of the trained test split.

00:26:21.000 --> 00:26:26.000
But now you will configure your models. Performance on the validation set, and see how like which one compares better.

00:26:26.000 --> 00:26:39.000
So this is typically done when you have a model that takes a very long time to train, and you don't want to wait for the next thing, we'll learn about which is cross validation.

00:26:39.000 --> 00:26:48.000
Or if you have a data set where it's a very small data set and you don't have enough data to split it multiple times, we'll talk about that again before we end the notebook.

00:26:48.000 --> 00:26:52.000
So schematically. What this looks like is, you start off with a picture we had before, where the entire data set is split into a train set, and then a test set.

00:26:52.000 --> 00:26:59.000
And then that training set is further randomly split into a validation set, and then a smaller training set.

00:26:59.000 --> 00:27:09.000
So this smaller training set is the one that's then used to train the models.

00:27:09.000 --> 00:27:14.000
And then this validation set is used for the models, and then this validation set is used for the estimate.

00:27:14.000 --> 00:27:32.000
The estimating of the capital. G. So like I might train 4 or 5 different models here, calculate the error on the validation set, and then find the one that has the best error so in practice, this is done exactly the same way with train test split but instead of X comma Y it

00:27:32.000 --> 00:27:39.000
would be X underscore trained? Y underscore train, and then you just provide all the remaining arguments.

00:27:39.000 --> 00:27:45.000
So shuffle equals true test size equals point.

00:27:45.000 --> 00:27:52.000
Let's say point 2, and then random States is equal to.

00:27:52.000 --> 00:28:13.000
Let's do 2, 3, 2. Okay, so are there questions about the validation split?

00:28:13.000 --> 00:28:14.000
Sorry I've got.

00:28:14.000 --> 00:28:20.000
Sorry go ahead.

00:28:20.000 --> 00:28:30.000
Okay. So I was just gonna ask, like, it might happen that you have minimized G over all your validation over your validation set or cross validation settings.

00:28:30.000 --> 00:28:38.000
When you like doing.

00:28:38.000 --> 00:28:43.000
I said the address in your testing set is still like high.

00:28:43.000 --> 00:28:56.000
So what do you?

00:28:56.000 --> 00:29:03.000
And do like repeat the process by changing your you know what I mean. Like, the yeah.

00:29:03.000 --> 00:29:06.000
Like, how would you do that when?

00:29:06.000 --> 00:29:20.000
Yeah, yeah, so that can happen. So basically, you typically will. So if you have like a model like, I said earlier, like, if you have a model that takes a really really long time to train, like sometimes you'll have like models that could take like an whole evening, to train.

00:29:20.000 --> 00:29:25.000
And it's it just takes a while to train the model.

00:29:25.000 --> 00:29:35.000
So if you have something like that, and those are the models you're comparing, you'll typically like, start with just the one that performs best on the validation set.

00:29:35.000 --> 00:29:36.000
As long as, like the test set. Final check like. Didn't reveal anything like wrong with the model.

00:29:36.000 --> 00:29:41.000
Yeah.

00:29:41.000 --> 00:29:43.000
If that makes sense so meaning like you didn't have like a typo, or it's performing unexpectedly something like that.

00:29:43.000 --> 00:29:44.000
So I get.

00:29:44.000 --> 00:29:45.000
Even though it increases the overall error on the test.

00:29:45.000 --> 00:29:55.000
Set a little bit more than the validation set.

00:29:55.000 --> 00:29:56.000
Yeah, so like, there's because there are different sets.

00:29:56.000 --> 00:29:58.000
Yeah.

00:29:58.000 --> 00:29:59.000
The errors will be different, and so like it could possibly be the case that you perform worse on your test set than your validation sets. The errors will be different, and so it could possibly be the case that you perform worse on your test set than your validation a huge difference that you

00:29:59.000 --> 00:30:09.000
Okay.

00:30:09.000 --> 00:30:17.000
can't like. Look at the data and be like, Okay, well, my test set had this one outlier, and that's why it's performing way worse.

00:30:17.000 --> 00:30:22.000
You would still go with that model.

00:30:22.000 --> 00:30:27.000
Yup!

00:30:27.000 --> 00:30:28.000
And there was a there was another question. Or was that like a like, did you have the same question?

00:30:28.000 --> 00:30:34.000
Yeah, you've got it adjusted. Thank you.

00:30:34.000 --> 00:30:42.000
Okay. Yup, okay. So the other process that you'll do, which is ideal.

00:30:42.000 --> 00:30:47.000
If you can do this, you would prefer to do this is K-fold cross validation.

00:30:47.000 --> 00:30:48.000
So if you've taken statistics, courses, this is sometimes referred to as like you might see it.

00:30:48.000 --> 00:30:58.000
A variation called like one out, one left out validation, or something like that, where you will go through and leave everything except for one observation out.

00:30:58.000 --> 00:31:16.000
This is the same exact process. If you've heard of that, if you haven't heard of that, you're going to learn about crossover. So this validation, set approach in essence is like coming from a statistics frequency, statistics point of view it.

00:31:16.000 --> 00:31:32.000
Gives you like a point. Estimate of G. So this approach is an ideal like you, instead of having a single estimate of G so this approach is an ideal like you, instead of having a single estimate of G, which, like maybe this particular set like model one is just randomly better than model 2 on this

00:31:32.000 --> 00:31:34.000
particular set, but in practice, like model 2 on average, is better.

00:31:34.000 --> 00:31:37.000
It would be ideal to know something about like the distribution of G as opposed to just a single estimate.

00:31:37.000 --> 00:31:50.000
So it's difficult to get an idea of the distribution of G.

00:31:50.000 --> 00:32:12.000
But what we can do is try and leverage a rule from a probability called the law of large numbers, and so, as a reminder of what that says, if you have a sequence of independent, identically distributed random variables with some true mean mu than the arithmetic mean of that meaning you add them all up and divide by

00:32:12.000 --> 00:32:19.000
the total number of them. The law of large numbers says that this arithmetic mean in the limit, as N.

00:32:19.000 --> 00:32:23.000
Goes to infinity. We'll approach Mu. So basically, what this is saying is that you can use the arithmetic mean to estimate the expected value of the distribution.

00:32:23.000 --> 00:32:35.000
So you can use a sequence of a sequence of estimates of this generalization. Error.

00:32:35.000 --> 00:32:43.000
What we'll talk about how you can get that in a second to sort of try and estimate the expected value of the error.

00:32:43.000 --> 00:32:51.000
And so that's sort of the idea here, with Kfold cross validation is we're gonna try and create a sequence of estimates that we're then going to average together.

00:32:51.000 --> 00:32:58.000
So how do we do this? So you're going to take your training set randomly, split it into K different smaller subsets.

00:32:58.000 --> 00:33:23.000
So this cadence and this example is 5 different. Then what you're going to do after you do that split is you're going to sequentially go through train on 4 of the 5 or K minus one of the K train on 4 of the 5 here, and then forget the error.

00:33:23.000 --> 00:33:35.000
on the one that's left out. And then you just sequentially go through and change the one that's left out. And then you just sequentially go through and change the one that is left out that way will give you 5 in this case, 5 estimates of that error what you can then average together.

00:33:35.000 --> 00:33:36.000
To try and get an estimate of G. So common values are 5 and 10.

00:33:36.000 --> 00:34:00.000
But you could do different values if you'd like and that's sort of the idea is by doing this process, we can sort of imagine we're getting random draws of the generalization error as a random variable, which we can then try and use the law of large numbers to say that this is some

00:34:00.000 --> 00:34:13.000
estimate of the expected value. It's not like a perfect estimate, because 5 is not a large sample, but we're limited with what we can do, based on the size of our data, and how long model training takes.

00:34:13.000 --> 00:34:14.000
Okay. So before we talk about how to implement this, and are there any questions about this?

00:34:14.000 --> 00:34:31.000
Quick question is this preferred over time? Something like bootstrapping, where you're sampling with replacement from the set.

00:34:31.000 --> 00:34:32.000
So bootstrapping sort of a different thing. So the reason you have to do this is so.

00:34:32.000 --> 00:34:41.000
That like the. So basically, Jason is asking, right? This is sort of, I'm just combining the 2.

00:34:41.000 --> 00:34:52.000
So Jason's asking like, can the left out validation sets overlap so like with bootstrapping, they would overlap, but in cross validation is they don't overlap.

00:34:52.000 --> 00:34:59.000
And so the idea there is to prevent like leakage from your test set into the training set across the different ones.

00:34:59.000 --> 00:35:12.000
So here, basically, like your test set is guaranteed to be different from estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimateimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to

00:35:12.000 --> 00:35:13.000
estimate.

00:35:13.000 --> 00:35:14.000
Okay, so so this is more robust than a estimate error.

00:35:14.000 --> 00:35:23.000
Then bootstrapping, which also gives an estimate of error.

00:35:23.000 --> 00:35:24.000
Yeah, so yeah, I just know that, like, in practice, this is what's done.

00:35:24.000 --> 00:35:40.000
I think bootstrapping will sometimes be done for, like estimating of parameters, as opposed to like estimating the error of the model.

00:35:40.000 --> 00:35:45.000
Ashley's asking is each of these columns of the data the same data? Yes.

00:35:45.000 --> 00:35:52.000
So this is the same training set each time there's a column it's the training set, but it's just broken down into the splits.

00:35:52.000 --> 00:35:55.000
And then demonstrating that each time there's a column, it's the training set, but it's just broken down into the splits.

00:35:55.000 --> 00:36:02.000
And then demonstrating that each time you go through and leave out a different one of the splits.

00:36:02.000 --> 00:36:03.000
So Brantley is asking to apply the law of large numbers. We need G.

00:36:03.000 --> 00:36:11.000
One through g. 5. To be independent samples from the true distribution of G.

00:36:11.000 --> 00:36:12.000
These seem like they would be dependent if we generate them like this.

00:36:12.000 --> 00:36:24.000
So it's not like a perfect one to one of like we're going out and getting new samples.

00:36:24.000 --> 00:36:28.000
We can't do that. We're limited like we went out.

00:36:28.000 --> 00:36:33.000
We've got our observations to generate like the we got our observations to, you know.

00:36:33.000 --> 00:36:34.000
Get the green bar. Basically, we get the all data is our observations.

00:36:34.000 --> 00:36:43.000
We are unable to go out and get additional observations.

00:36:43.000 --> 00:36:47.000
So like we don't live in like the ideal world where we can go out and just conduct experiments until we have enough data to make estimates.

00:36:47.000 --> 00:37:06.000
So with the restrictions that we have from doing our initial sampling like this is like our best approximation of Iid random variables that we can then use to sort of get an estimate of the expected value.

00:37:06.000 --> 00:37:13.000
So these are not, you know. This is not like a perfect repetition of like the statistical process.

00:37:13.000 --> 00:37:14.000
We may be learned in stats class. This is sort of trying to do our best to replicate that, to get an idea of what we might expect, the generalization error to be.

00:37:14.000 --> 00:37:24.000
Question you mentioned about like knowing something about the distribution of the generalization error as opposed to just, you know, having a single estimate for it I don't know if it's ever feasible to get anything but the distribution other than like the mean but if so what would you like would you

00:37:24.000 --> 00:37:54.000
ever care about. It's just to other than the fact that the main, you know, like normal versus something else, does that factor into model selection?

00:37:54.000 --> 00:37:57.000
So like!

00:37:57.000 --> 00:38:05.000
You could like, compare the distributions, and then, like let's say hypothetically, you could get an estimate of the distribution.

00:38:05.000 --> 00:38:11.000
You could compare the 2, and then see, like you probably want the one.

00:38:11.000 --> 00:38:16.000
So it is, I mean, it is kind of like a mean, but like a, you know, distribution is not defined by its mean right.

00:38:16.000 --> 00:38:19.000
So if you had, like a by various thing, the mean would be in the middle.

00:38:19.000 --> 00:38:23.000
But you'd never actually, you know, get the mean rate.

00:38:23.000 --> 00:38:26.000
So basically like you could compare the 2 distributions.

00:38:26.000 --> 00:38:32.000
And then you'd want to choose the one that has a higher probability of being the better error right?

00:38:32.000 --> 00:38:39.000
So!

00:38:39.000 --> 00:38:40.000
So Ricky's asking if your training set is really large, would you benefit from a large Cav value?

00:38:40.000 --> 00:38:54.000
So I think, as long as you can guarantee that your test set has like enough observations to be.

00:38:54.000 --> 00:39:00.000
Enough observations to be a good measure of the error.

00:39:00.000 --> 00:39:07.000
Then, yeah, I think larger. K, but again, like, it's not just about you also have to be considering, like the computation time.

00:39:07.000 --> 00:39:16.000
So the larger K. Is, the more times you have to fit the model, and then calculate the error and a lot of times the models you'll choose are maybe gonna take a while to train.

00:39:16.000 --> 00:39:25.000
And so that's another consideration. So from what I understand in practice, you tend to stick to 5 or 10, you can try different ones.

00:39:25.000 --> 00:39:26.000
If you'd like. But you know, if you let Kay get too large, you're spending a lot of time training your model.

00:39:26.000 --> 00:39:41.000
I have a question. So with cross validation, there's a pretty good chance that the the app or the mean error here will be much higher than what you would find for probably your training set after you select the model is is that like a common problem, and if that's the

00:39:41.000 --> 00:39:50.000
case, how would you go about? You know, gettinging with that is okayful cross validation.

00:39:50.000 --> 00:40:20.000
In that case really, is it really effective? Or is it better to move on with a different way of data splitting?

00:40:20.000 --> 00:40:22.000
Yeah, so in predictive modeling, we don't really care about the performance on the training set, because we already know the labels for those.

00:40:22.000 --> 00:40:36.000
What we want to do is to be able to produce a model that is good at guessing or predicting what the labels would be on data that we don't.

00:40:36.000 --> 00:40:38.000
And in this case the left out sets for pretending.

00:40:38.000 --> 00:40:42.000
We don't know, but we do know. You know, we want a model that's good at predicting the labels for things that we don't know, you know, ahead of time.

00:40:42.000 --> 00:40:49.000
So we could have a model that's perfect on the training set.

00:40:49.000 --> 00:40:54.000
But if it's unable to make good predictions on things it doesn't know the label, for it's useless to us.

00:40:54.000 --> 00:41:06.000
So here you can have a M like it's going to typically be the case that your models perform better on the training set than on these hold out sets, or on the validation set.

00:41:06.000 --> 00:41:14.000
But that's fine. The point is to find the model that performs the best on these left out sets, or on the vidation set.

00:41:14.000 --> 00:41:16.000
If that's the approach you take, because that's what your goal is with.

00:41:16.000 --> 00:41:17.000
Predictive, modeling, it's to make the best predictions.

00:41:17.000 --> 00:41:24.000
And in with predictions. We're assuming we don't know what the labels are here.

00:41:24.000 --> 00:41:25.000
We're taking advantage of the fact that we do have the labels to get a sense of how good we're predicting.

00:41:25.000 --> 00:41:35.000
Thank you.

00:41:35.000 --> 00:41:37.000
Yeah, okay, so for the sake of time, I'll cut off questions. There.

00:41:37.000 --> 00:41:42.000
So we have enough time to finish the other 2 notebooks.

00:41:42.000 --> 00:41:43.000
So how do we do Kfld? Cross-solidation and sk learn they have an object for that.

00:41:43.000 --> 00:41:54.000
So in model selection, there's the K fold class, so we'll do from Sk.

00:41:54.000 --> 00:42:00.000
Learn model underscore selection, and we import capital K. Capital, F. K.

00:42:00.000 --> 00:42:02.000
Fold.

00:42:02.000 --> 00:42:03.000
And so the way this works is, you'll first put in.

00:42:03.000 --> 00:42:09.000
You want to create what's known as the K-fold object.

00:42:09.000 --> 00:42:14.000
So you first put in the number of splits. So end splits, and here we'll do 5.

00:42:14.000 --> 00:42:19.000
The next thing you want to do is you want to say that you do want it to randomly shuffle the data so shuffle equals true and just like with trained test split.

00:42:19.000 --> 00:42:28.000
You can specify the random state. So why don't we do?

00:42:28.000 --> 00:42:34.000
7, 5, 8. So now we have. Oh, what did I do?

00:42:34.000 --> 00:42:44.000
Oh, splits, not. There we go. Okay. So now that we have that, we, when we want to go ahead and make the 5 fold Cross validation split, you'll call K.

00:42:44.000 --> 00:42:56.000
Fold, dot split, and then you will then put in the features, first followed by the outputs, and so here I'm going to demonstrate what you get, so this is what's known as a generator object.

00:42:56.000 --> 00:43:02.000
So in Python. Those have to be iterated through with like a for loop.

00:43:02.000 --> 00:43:08.000
So you'll do 4, and then what's gonna come out?

00:43:08.000 --> 00:43:25.000
Are indices, so like indices of either the array or of the data for if that's what you put in, so you're gonna do train because the train index comes first, followed by the test index in this Split.

00:43:25.000 --> 00:43:29.000
And so I'm going to run this and you'll see what you get.

00:43:29.000 --> 00:43:40.000
So first you get a list of the indices of the training set, and then you get a list of the indices of the test set, and so we'll focus on just showing that the splits are in fact, different for one another.

00:43:40.000 --> 00:43:45.000
So you're gonna get 6, 7, 9 in the test set here.

00:43:45.000 --> 00:43:51.000
And then, if you look here, you can see 6, 7, and 9 are now in the training set.

00:43:51.000 --> 00:43:56.000
And so this will give you the trained test index indices for all 5 splits.

00:43:56.000 --> 00:44:15.000
And so then in practice, what you would do is each time through the loop you would subset to get the training set from the split and the holdout set for the split, so that up in the graphic above, those are the things that I labeled leave out and then in practice you would fit the

00:44:15.000 --> 00:44:20.000
model and record the error on the holdout set in some kind of array that you could look at later.

00:44:20.000 --> 00:44:27.000
Okay. Alright. So Jacob has asked How is model selection?

00:44:27.000 --> 00:44:32.000
Careful, different from model selection, cross, validate. So I've only ever used case.

00:44:32.000 --> 00:44:49.000
So I don't know what the difference between kfold and cross validate is, but if you were to go to the documentation links, you could figure it out like it will tell you what each one does and what each one returns so for instance, just to demonstrate what the

00:44:49.000 --> 00:44:50.000
documentation links look like here. This is the documentation from Sk.

00:44:50.000 --> 00:45:07.000
Learn for Kfold. It tells you. You know, K-fold cross validator provides trained tests, indices to split data, and then we could try and compare this to.

00:45:07.000 --> 00:45:16.000
What was it cross, validate? So if I just replace K-fold with cross, underscore, validate?

00:45:16.000 --> 00:45:20.000
So this says, evaluate metrics by cross validation, and also record fit score time.

00:45:20.000 --> 00:45:25.000
So this is something where you'd have to provide the algorithm.

00:45:25.000 --> 00:45:36.000
The data, and then how you want to score. So this is a way of basically, it would do the whole cross validation process for you where this is just providing the splits of the data.

00:45:36.000 --> 00:45:52.000
And then you can. Then you know in this for loop that I wrote out below, you would have to fit the data, fit the model and then record the holdout error.

00:45:52.000 --> 00:45:56.000
So why would we want to do one versus the other?

00:45:56.000 --> 00:46:01.000
So validation sets tend to be you like the main considerations when you're choosing between the 2 is data set size and model training time.

00:46:01.000 --> 00:46:15.000
So in general, you want to use Cross Valleyation when it's feasible, because you get a better sense of the errors on more than just a single validation set in practice.

00:46:15.000 --> 00:46:23.000
If you're data set is small. And again, it's hard to say what small means like, I don't have a number like a hundred or 10.

00:46:23.000 --> 00:46:28.000
It really depends on the problem you're doing. And the type of model you're trying to fit on the problem you're doing. And the type of model you're trying to fit.

00:46:28.000 --> 00:46:29.000
So if your data set is small, or if your data set is small, or if your model takes a very long time to train, you'll do validation sets because that's just what you're able to do.

00:46:29.000 --> 00:46:38.000
A very long time to train. You'll do validation sets, because that's just what you're able to do.

00:46:38.000 --> 00:46:41.000
If you have a large in a large data set, and your models are relatively quick to train, where again, there's not like a good rule of thumb for this, it just depends on the problem you're working on.

00:46:41.000 --> 00:46:56.000
Then you'd use cross validation. Okay? So just for the sake of time, I'll pause questions and move on.

00:46:56.000 --> 00:47:01.000
So you'll have a chance to practice with cross validation.

00:47:01.000 --> 00:47:05.000
And the problem session tomorrow. For now let's get.

00:47:05.000 --> 00:47:08.000
Let's dive in and learn an actual algorithm.

00:47:08.000 --> 00:47:15.000
So we're gonna start with simple linear regression, which I'm sure is familiar to a lot of you.

00:47:15.000 --> 00:47:31.000
But it's a good place to start, because over the course of the next 2 lectures, we'll build a upon this to get more complicated models, and it's also going to allow us to sort of apply this sort of cross-validation approach in the next lecture notebook.

00:47:31.000 --> 00:47:37.000
Okay. So the simple linear regression model is, you have this situation.

00:47:37.000 --> 00:47:46.000
You have a variable. You want to predict, called y, and then you have a single feature that I'm going to call Little X here, just because that's what I am used to.

00:47:46.000 --> 00:47:53.000
And so remember our supervised learning framework that y is equal to F of x plus some error.

00:47:53.000 --> 00:48:04.000
So in this case the F of X is an actual function that we're assuming a form for which is basically beta 0 plus beta one x and then plus epsilon, which is our error.

00:48:04.000 --> 00:48:12.000
So beta 0 and beta one are real numbers that are constants known as parameters, that we're going to estimate.

00:48:12.000 --> 00:48:22.000
And then for simple linear regression, we have the assumption that the Epsilons are normally distributed with 0 mean and a common standard deviation.

00:48:22.000 --> 00:48:30.000
And that the error term is independent from x so to sort of visualize what this model looks like.

00:48:30.000 --> 00:48:31.000
If we have this blue line representing Y equals Beta 0 plus beta one x.

00:48:31.000 --> 00:48:44.000
So that's systematic form. Then the Epsilons for each value of X are being drawn from the same normal random distribution, and then added to the value of y given by the line.

00:48:44.000 --> 00:48:53.000
So you can sort of imagine that for any given value of X we go to the line.

00:48:53.000 --> 00:49:00.000
Then we can think of like we're drawing a random error, and then, either, you know, adding it to that value.

00:49:00.000 --> 00:49:06.000
So like it's less likely that we'd be up here down here, and more likely that we'd be closer to the line.

00:49:06.000 --> 00:49:15.000
So this is like what the model is assuming, and then you collect your data and then if it's a good like, we'll see if it's a good fit.

00:49:15.000 --> 00:49:32.000
So we make these assumptions, because if the assumptions hold, we can derive some nice features about the estimates and the predictions that we will touch on some of them are touched on in later lecture notebooks, some of them are touched on in the practice problems for the

00:49:32.000 --> 00:49:39.000
regression. They allow you to derive nice features of the estimators.

00:49:39.000 --> 00:49:45.000
So Melanie asked.

00:49:45.000 --> 00:49:50.000
Okay. So, Melanie, I'm not sure why you are unable to see my screen.

00:49:50.000 --> 00:49:55.000
I am everybody else. It says that I'm screen sharing on my screen and everybody else.

00:49:55.000 --> 00:49:58.000
Okay. Great. Awesome. Okay. So how do you fit this model?

00:49:58.000 --> 00:50:03.000
So in general, we're gonna use python to fit the models.

00:50:03.000 --> 00:50:07.000
But I also think it's useful to know, how the algorithms are fit.

00:50:07.000 --> 00:50:12.000
So there's a couple reasons for this is, it's nice to get away from the black box.

00:50:12.000 --> 00:50:16.000
Idea of our machine learning where you know, data goes in, prediction comes out, and you don't know what's happening in production comes out, and you don't know what's happening in the middle.

00:50:16.000 --> 00:50:27.000
So a lot of times things could go wrong with your model, and it's useful to have an idea of like what's going on behind the scenes just like it's useful to have an idea of how your car works.

00:50:27.000 --> 00:50:30.000
If something breaks and you want to be able to fix it yourself.

00:50:30.000 --> 00:50:31.000
That being said, you don't always have to know how it works.

00:50:31.000 --> 00:50:42.000
There are plenty of positions where you know your bosses don't care if you know how like the thing is being fit in the background just as long as you know enough to. You know.

00:50:42.000 --> 00:50:50.000
Make the business money so if you're a person that's perfectly like you just want to figure out how to do it in python, we're going to cover how to do that.

00:50:50.000 --> 00:50:58.000
And if you're a to do it in python, we're gonna cover how to do that. And if you're a person who wants to know how the algorithms work, we're also going to try our best to cover that so just feel free to like pay attention to the parts that you're most interested

00:50:58.000 --> 00:51:01.000
in and ask questions from that. Yeah, okay, so how do we fit the model?

00:51:01.000 --> 00:51:10.000
So the way that we fit a model is we're going to define what's known as a loss or an error function.

00:51:10.000 --> 00:51:14.000
So the things that we're estimating for are the Beta 0 and the beta one and how do we estimate this?

00:51:14.000 --> 00:51:17.000
We take, we need a loss function. And for regression problems.

00:51:17.000 --> 00:51:25.000
So regression problems are outputs are continuous, like the one we have here for regression problems.

00:51:25.000 --> 00:51:35.000
Our loss functions are the mean, square error, so, or Msc.

00:51:35.000 --> 00:51:36.000
So the mean, square error is given by one over N.

00:51:36.000 --> 00:51:44.000
The sum of I equals one to N. So remember N. Is the number of observations.

00:51:44.000 --> 00:51:50.000
The actual value minus the estimated value. So that's what the little hat denotes.

00:51:50.000 --> 00:51:51.000
That value squared. So that's the square part you're taking.

00:51:51.000 --> 00:52:00.000
The square of the difference, and then the mean part is one over N.

00:52:00.000 --> 00:52:08.000
Of that sum, so mean as an average value. So plugging this in for simple linear agreression, our estimate is beta 0 hat minus beta, or plus beta one hat x.

00:52:08.000 --> 00:52:31.000
I. So that's where this part comes from. So if you do a little bit of calculus, and then some algebra, you can find out that the values of beta 0 hat and beta one hat that minimize this mse so in this you know we want our errors to be small

00:52:31.000 --> 00:52:50.000
so the values that make this as small as possible are given by Beta 0 being the average value of y minus beta one hat times the average value of x, where these averages are found using the data that you've observed the training center and then beta one hat is given by the sample

00:52:50.000 --> 00:52:56.000
covariance of X and Y, divided by the sample variance of X.

00:52:56.000 --> 00:52:57.000
Okay, so Nsc is used as the default loss function for a number of reasons.

00:52:57.000 --> 00:53:06.000
So a lot of those reasons come from its roots as a statistical regression technique.

00:53:06.000 --> 00:53:18.000
A nice reason to use it is that this is this function is differentiable with respect to the Beta hat, it's also convex, meaning that if you're able to find the minimum, it is a unique minimum.

00:53:18.000 --> 00:53:31.000
Other things you might use. Some people might want you to use a Mac a mean, absolute error, or Mae and if you're interested in learning more about that, you can check out the regression practice problems.

00:53:31.000 --> 00:53:35.000
Notebook, okay, so before we show you how to do this in S.

00:53:35.000 --> 00:53:36.000
Learn. Does anybody have a question about the model, or how to fit?

00:53:36.000 --> 00:53:54.000
So what would you do if you had errors in X and y's data like?

00:53:54.000 --> 00:54:11.000
Yeah, so you never have. So we had just assumed that the at like in practice, you might have, like, you know, some sort of error with recording your X, but in the model you're just assuming that your X's are what they say, they are so like for instance, if you had data that was related to

00:54:11.000 --> 00:54:15.000
like the height in the web of somebody like you're using those as features.

00:54:15.000 --> 00:54:20.000
You're just assuming that those are correct like, that's just an assumption of the model.

00:54:20.000 --> 00:54:27.000
Like, yeah, I don't know how to say that it's just an assumption of the model. There.

00:54:27.000 --> 00:54:39.000
You would not have a situation with like linear regression, the way it's classically set up where you'll a lot for the inputs to also have errors.

00:54:39.000 --> 00:54:40.000
Yup!

00:54:40.000 --> 00:54:41.000
Thanks.

00:54:41.000 --> 00:54:42.000
I think, in other words, that's basically saying you have bad data, and you're trying to build a model on a data that's that's very much error.

00:54:42.000 --> 00:54:43.000
Prone. So I think it's best to have data.

00:54:43.000 --> 00:54:44.000
That's right, and yes, there will be some inherent error.

00:54:44.000 --> 00:55:14.000
But again, going back to the purpose of machine learning models is to actually predict, make predictions without you actually going back to your experiment, or whatever it is to generate that data.

00:55:25.000 --> 00:55:26.000
Yeah, so that's like, that's one way to think about it.

00:55:26.000 --> 00:55:47.000
So my high school mathematics teacher, whenever we are like using our calculator to solve a problem would always encourage us to like double check like the inputs of our calculations, saying, like, which stood for garbage in garbage out so you know, like they said, if you had

00:55:47.000 --> 00:55:58.000
bad input data from, you know, faulty measurements. Then your model is also not going to be very good.

00:55:58.000 --> 00:56:06.000
Any other questions?

00:56:06.000 --> 00:56:17.000
Okay. So you could fit. Because this is just, you know, these estimates are found with a sample means sample covariance and sample variance.

00:56:17.000 --> 00:56:21.000
You could calculate this by hand, using numpy or pandas.

00:56:21.000 --> 00:56:25.000
But we're just going to get into the role like the swing of using sk, learn.

00:56:25.000 --> 00:56:39.000
So sk learn is sort of like the workhorse of traditional machine learning algorithms and by that I mean, like the non deep learning stuff so it's sort of the workhorse in python for doing this sort of thing.

00:56:39.000 --> 00:56:53.000
So they have what are known as model objects, which for a lot of almost all of the algorithms we learn, they're going to have a model object that will then take in the data fit to find whatever parameters they need to fit.

00:56:53.000 --> 00:57:04.000
And then allow you to make predictions with that. So we're going to learn sort of that workflow in this notebook in particular, we're learning how to use the linear regression model object right now.

00:57:04.000 --> 00:57:10.000
So we're gonna use it to predict on this synthetic data.

00:57:10.000 --> 00:57:13.000
So this is synthetic, because I used numpy to generate it.

00:57:13.000 --> 00:57:14.000
So it's random data. But it's not like real data.

00:57:14.000 --> 00:57:25.000
So we have X, which is this randomly distributed, uniformly from 0 to one it's 100 observations. And then why? Where?

00:57:25.000 --> 00:57:26.000
The true relationship is 2 x plus one, and our random noise is random.

00:57:26.000 --> 00:57:36.000
Normally randomly distributed with a standard deviation of point 5.

00:57:36.000 --> 00:57:37.000
Okay, so this is what that looks like. These are our observations.

00:57:37.000 --> 00:57:43.000
And now we're going to use Sk. Learn to fit a lid.

00:57:43.000 --> 00:57:49.000
A simple linear regression model of why regressing onto X so the first thing you're gonna do in these workflows is you'll import the model class.

00:57:49.000 --> 00:58:01.000
So from sk, learn, linear regression is stored in linear underscore model.

00:58:01.000 --> 00:58:09.000
Well import linear with a capital L. Regression with the capital R.

00:58:09.000 --> 00:58:17.000
And so this might be a good time to pause and say in Python the syntax standard is when you have a class you'll use what's known as camel back.

00:58:17.000 --> 00:58:21.000
I believe it's called camelback Typing.

00:58:21.000 --> 00:58:24.000
So you'll alternate. When you have a new word.

00:58:24.000 --> 00:58:42.000
It starts with a capital letter. Then all of the remaining letters are lowercase, and then, when you have a new word instead of like an underscore, it would be another capital so this is for classes and objects where other things like functions, tend to be separated with underscore so this is

00:58:42.000 --> 00:58:47.000
just a note on Python syntax for those of you that are that are new to python.

00:58:47.000 --> 00:58:52.000
Okay, so after we've imported our linear linear regression class, we can now make a an empty model object.

00:58:52.000 --> 00:59:00.000
So we will do. Let's do, Slr. Is gonna be the variable.

00:59:00.000 --> 00:59:07.000
I store it in. Then I'm gonna do linear regression, parentheses.

00:59:07.000 --> 00:59:13.000
So some of these models will have inputs that you can then do this to customize the model.

00:59:13.000 --> 00:59:16.000
So I'm gonna use the standard model, the default.

00:59:16.000 --> 00:59:22.000
One thing that might be worth doing is putting in the argument, copy, underscore X equals true.

00:59:22.000 --> 00:59:38.000
So what this is going to ensure happens is that when linear regression takes in our X and our Y, it will make copies of the array before fitting the algorithm fitting the model.

00:59:38.000 --> 00:59:43.000
And so in Python, you wanna make sure like with these sorts of things that you make copies.

00:59:43.000 --> 00:59:52.000
So you're not accidentally altering the original data that can happen with the way that python stores data in its, you know, in your computer.

00:59:52.000 --> 00:59:53.000
So just to be safe, I usually will put the copy underscore X.

00:59:53.000 --> 00:59:54.000
Argument equals to true. Okay? So now I have a linear regression model on Yup.

00:59:54.000 --> 01:00:07.000
What? Why, only the copy eggs? What? What about copy? One?

01:00:07.000 --> 01:00:08.000
So this is just what the argument is called so like.

01:00:08.000 --> 01:00:16.000
In the algorithm. I believe the X is the one that's manipulated, whereas, like, why, I think just gets to be Y.

01:00:16.000 --> 01:00:17.000
That might be. Why, that they use copy underscore X instead of copy underscore. Why, I don't think there is a copy under score y argument.

01:00:17.000 --> 01:00:29.000
Okay. Thank you.

01:00:29.000 --> 01:00:30.000
So now that we have the model object, we could look at it.

01:00:30.000 --> 01:00:38.000
This is what it looks like. It's it's not fit yet.

01:00:38.000 --> 01:00:39.000
And so you might not be able to see this like. When you do this, you might just see the text.

01:00:39.000 --> 01:00:40.000
It might just depend on your version of Jupiter Notebooks.

01:00:40.000 --> 01:00:43.000
So once we have the empty model object, we can fit it. So we do.

01:00:43.000 --> 01:00:56.000
Slr dot fit. So here's gonna be like our first instance of something that I think really tends to confuse people.

01:00:56.000 --> 01:01:03.000
So, in order to fit your models, your features have to be what's known as a 2D array.

01:01:03.000 --> 01:01:14.000
So if we look at X right now, and do X dot shape, it's a one d numpy array, meaning that it has a single direction which is, it's just one dimensional.

01:01:14.000 --> 01:01:23.000
So right now mathematically, we could think of it as a row vector but what we need it to be is a column vector or 2 dimensional.

01:01:23.000 --> 01:01:27.000
So what? We're gonna do is what's known as reshape. So we do.

01:01:27.000 --> 01:01:39.000
X dot reshape negative, one comma one. And so what this does is if we look at the original x, so this is the original X, and we can kind of see it's like a row.

01:01:39.000 --> 01:01:47.000
But once we've done reshape, this now makes it a column, and if we look at the shape of that X dot reshape negative one comea one dot shape.

01:01:47.000 --> 01:01:55.000
We look at the shape of that, it is now a 2 dimensional.

01:01:55.000 --> 01:02:00.000
Vector it has array. It has 100 rows and one column.

01:02:00.000 --> 01:02:02.000
So what reshape does is it allows you to input arguments that will dictate the shape of the array.

01:02:02.000 --> 01:02:16.000
So the one in the second position tells numpy that I want it to have a single column, and then the negative one here says, Make this whatever it dimension you need it to be, to fill in the array.

01:02:16.000 --> 01:02:38.000
So I could have replaced this with a 100, and it still would have worked, but in general we don't know how many observations we're going to have, so it's better practice to use a negative one, because no matter what shape our X is this will still work okay, why, did we need to

01:02:38.000 --> 01:02:41.000
go through that big, long spiel about doing reshape.

01:02:41.000 --> 01:02:43.000
Well, this is what?

01:02:43.000 --> 01:02:45.000
What would happen if we did? X comma y. We get an error.

01:02:45.000 --> 01:02:51.000
And why do we get this error? We can scroll all the way down, and we can see that it says you got this.

01:02:51.000 --> 01:02:58.000
Error, because I expected a 2D. Route, but I got a one d array instead.

01:02:58.000 --> 01:03:01.000
So the way that sk learn writes its algorithm.

01:03:01.000 --> 01:03:02.000
To.

01:03:02.000 --> 01:03:03.000
Models in the background. It's assuming that the features are stored in a 2D. Array.

01:03:03.000 --> 01:03:04.000
So we. That's why we have to do the reshape negative one.

01:03:04.000 --> 01:03:05.000
Okay. So now we have a fitted linear regression. And once our linear regression.

01:03:05.000 --> 01:03:06.000
We could.

01:03:06.000 --> 01:03:20.000
Sr. Dot predict, and then do. I'm just going to go ahead and copy and paste this and see that these are the predictions of.

01:03:20.000 --> 01:03:25.000
The linear regression model on the data we use to train it.

01:03:25.000 --> 01:03:32.000
Okay. So maybe before going into looking at all, the different parts of the simple linear regression are.

01:03:32.000 --> 01:03:33.000
Other questions.

01:03:33.000 --> 01:03:39.000
The model.

01:03:39.000 --> 01:03:41.000
Okay. Okay, so here's one, can you explain one more time? The reason to do reshaping?

01:03:41.000 --> 01:03:49.000
So simple linear regression.

01:03:49.000 --> 01:03:54.000
Model.

01:03:54.000 --> 01:03:58.000
Okay.

01:03:58.000 --> 01:04:04.000
Okay. Alright! Could not be fit with a one-dimensional array. So let's go back to what this looks like.

01:04:04.000 --> 01:04:13.000
So, and then put that comma back. So remember X on its own is one dimensional.

01:04:13.000 --> 01:04:19.000
If we go down to the error message, it will say.

01:04:19.000 --> 01:04:37.000
Read so. And then even here, it's really good error message, because it says, reshape your data, either using reshape if your data has a single feature or reshape if it contains a single sample, so because ours just has a single feature. That's why we use the

01:04:37.000 --> 01:04:40.000
first one, so it has to be a 2D array.

01:04:40.000 --> 01:04:43.000
So doing reshape negative, one comma one. So there are 2 entries here which means it will be 2 dimensional.

01:04:43.000 --> 01:04:55.000
We know that X is a single column. So we put a one in the second spot.

01:04:55.000 --> 01:04:59.000
In general, we might not know the number of rows our data has.

01:04:59.000 --> 01:05:06.000
So then we would use a negative one instead.

01:05:06.000 --> 01:05:10.000
So, you sip, is asking. One to simple transpose function on work.

01:05:10.000 --> 01:05:15.000
So if we did. X dot trepose we can check this out.

01:05:15.000 --> 01:05:23.000
Hey? And you can see that the shape of this is the same.

01:05:23.000 --> 01:05:26.000
Okay, so we have to do reshape so we've got some more questions.

01:05:26.000 --> 01:05:27.000
So pager's asking, why didn't I do an X.

01:05:27.000 --> 01:05:39.000
Y train test split. So one the reason I know it's like, Oh, I just went on this big long notebook of like, why I do Splits.

01:05:39.000 --> 01:05:47.000
This is just to demonstrate the model so like I'm not trying to make a predict like, I'm not trying to make any predictions.

01:05:47.000 --> 01:05:50.000
I'm just trying to show you a this is the moment this is how it works.

01:05:50.000 --> 01:06:04.000
This is how you fit it with Python. This is how you fit it in general, if this was a predictive modeling problem, I would make my train test split at the very beginning, and then go from there for the simplicity, and then go from there for the simplicity of like not having to

01:06:04.000 --> 01:06:09.000
go through and make those steps. I just showed you with the data.

01:06:09.000 --> 01:06:20.000
Another reason is like this is a situation where, if I wanted to go through and make those steps, I just showed you with the data, another reason is like this is a situation where, if I wanted to, I could just go generate more data at any time, I want to because this is synthetic data so like anytime I want

01:06:20.000 --> 01:06:24.000
I can rerun, random and generate X and Y all over again.

01:06:24.000 --> 01:06:28.000
Okay. So then Payelle is asking, What about? Why?

01:06:28.000 --> 01:06:34.000
Why don't we reshape that? This is just the way that sk learn has written its code? So it's.

01:06:34.000 --> 01:06:40.000
Did not expect y to be a 2 dimensional vector or 2 dimensional numpy array. It's fine and I think it's better to leave it as a one dimensional array.

01:06:40.000 --> 01:06:43.000
So you do not have to reshape. Why, you do have to reshape X.

01:06:43.000 --> 01:06:56.000
So basically like when we look at multiple linear regression, you'll see like X is a matrix.

01:06:56.000 --> 01:07:01.000
So basically, the people who wrote sk learn, I think, are expecting your features to be a matrix.

01:07:01.000 --> 01:07:02.000
And then your Y to be like a regular row vector, so I think that's why it's like that.

01:07:02.000 --> 01:07:13.000
But in general you're why does not have to be for sk learn a 2 dimensional ray.

01:07:13.000 --> 01:07:21.000
It can be a one dimensional array.

01:07:21.000 --> 01:07:32.000
Okay. Any other questions.

01:07:32.000 --> 01:07:39.000
Okay, so this is a regression model, which means that it has like, remember, we said that it had Beta 0.

01:07:39.000 --> 01:07:44.000
We are trying to estimate Beta one we are trying to estimate, so we can get all of that data.

01:07:44.000 --> 01:07:48.000
So to get the intercept, which is the estimate of Beta 0.

01:07:48.000 --> 01:08:00.000
You just do. Slr. The name of the variable dot intercept underscore, and so here we're estimating the intercepts to be point 9 9 7 5.

01:08:00.000 --> 01:08:02.000
You can get the estimate of Beta one with.

01:08:02.000 --> 01:08:16.000
Sr.co. F. Underscore. So here we're estimating that the coefficient is 2.1 5, and then here we can use this to actually predict, like show, like what the model is saying.

01:08:16.000 --> 01:08:21.000
So the black line here is the model that I just fit with simple linear regression.

01:08:21.000 --> 01:08:30.000
That I'm just providing, like an evenly spaced array from 0 to one and then predicting on that to get the Y values. Okay? So.

01:08:30.000 --> 01:08:32.000
So this is the model that I fit.

01:08:32.000 --> 01:08:38.000
Just now in this notebook. Okay?

01:08:38.000 --> 01:08:39.000
Yeah, so Chris is asking, what? Sorry?

01:08:39.000 --> 01:08:40.000
Co-fashioned, multiple betas if they were in the model.

01:08:40.000 --> 01:08:41.000
So when we learn about multiple linear regression, we'll see that.

01:08:41.000 --> 01:08:42.000
So if you're doing multiple.

01:08:42.000 --> 01:08:47.000
Efficient. The Co. F. Will hold all of the coefficients from the model, and the intercept will still hold the intercept.

01:08:47.000 --> 01:09:01.000
Any other questions about this notebook before we move on to our last notebook for today.

01:09:01.000 --> 01:09:08.000
Okay.

01:09:08.000 --> 01:09:09.000
Alright!

01:09:09.000 --> 01:09:12.000
So the last notebook we're gonna do is sort of give.

01:09:12.000 --> 01:09:18.000
You a rundown of like, how a predictive, modeling workflow might go.

01:09:18.000 --> 01:09:19.000
It's not exactly like how it will be at once.

01:09:19.000 --> 01:09:43.000
You hopefully get your dream job working out.

01:09:43.000 --> 01:09:44.000
Season. It's just hard to imagine now, because we're in the middle of the baseball season.

01:09:44.000 --> 01:09:47.000
But let's imagine that we're in November, and it's the off season.

01:09:47.000 --> 01:10:08.000
And so during the off season, baseball teams are looking to see like players that they can bring in or keep to improve the number of wins that they had this past season into the coming season, so one question you might have as someone who's working for a baseball team is Alright is it better to have better

01:10:08.000 --> 01:10:15.000
defensive players, which, in the sport of baseball, means that you're limiting the number of runs that your team is allowing.

01:10:15.000 --> 01:10:22.000
Or is it better to have good offensive players, meaning that you're increasing the number of runs that your team scores?

01:10:22.000 --> 01:10:26.000
So basically what you're trying to see. And this question is given like the number of runs.

01:10:26.000 --> 01:10:34.000
And the number of runs allowed like, which is better at predicting the number of wins you will have in a given season.

01:10:34.000 --> 01:10:40.000
Now this is a silly question, because, like this isn't realistic to like the real world of baseball, it's much more complicated.

01:10:40.000 --> 01:10:46.000
But we only know simple linear regression right now. So this is a perfect question for us.

01:10:46.000 --> 01:10:51.000
So the first thing we're gonna do is load the data.

01:10:51.000 --> 01:10:52.000
And then here's a random sample of that data, so we can see what it looks like.

01:10:52.000 --> 01:11:00.000
So here we have 5 rows of the data. We have teams.

01:11:00.000 --> 01:11:05.000
We have the year that the data comes from the league of the team.

01:11:05.000 --> 01:11:08.000
So in major League baseball, we have National League and American League.

01:11:08.000 --> 01:11:23.000
The number of games that team played in that season. The number of wins and losses that team had during the season, and then the number of runs scored by that team, and the number of runs allowed by that team so if you're new, if you're unfamiliar with baseball runs

01:11:23.000 --> 01:11:29.000
aloud, means. These are the total number of runs or points that the other team scored against them.

01:11:29.000 --> 01:11:42.000
So in 2,012 the Pittsburgh pirates scored 651 runs, and teams scored 674 runs against them.

01:11:42.000 --> 01:11:49.000
Okay. So once you get your data, the very first thing you should do is assuming you're not doing any sort of data transformations that like doing a log transform or anything, is the trained test split.

01:11:49.000 --> 01:12:00.000
So I'm going to go ahead, and import my train test split, which we saw earlier today.

01:12:00.000 --> 01:12:01.000
Then I'm going to make my train test split.

01:12:01.000 --> 01:12:14.000
You should notice here that I have baseball dot copy. So it's dot copy here is making a heart. I think it's called a hard copy.

01:12:14.000 --> 01:12:31.000
On the baseball data frame. So if you're working with the pandas data frame, you want to do this when you make the train test split, because otherwise you're going to technically be working on like the rows of the original data, frame and if you make any changes to it you'd

01:12:31.000 --> 01:12:38.000
also be changing the original data frame. So this ensures that you're getting a an actual copy of the data frame.

01:12:38.000 --> 01:12:41.000
And instead of the original data frame, but only the subset of the rows.

01:12:41.000 --> 01:12:48.000
So this is a python thing. It's just the way that they chose to store data in your computer.

01:12:48.000 --> 01:12:54.000
Okay, so now that I have my train test split, I will do some exploratory data analysis and you'll get some practice with this, in tomorrow's problem session.

01:12:54.000 --> 01:13:09.000
So you know the first thing you wanna do is one of the biggest assumptions in linear regression is that there is a linear relationship between your outputs which for us are the W's.

01:13:09.000 --> 01:13:10.000
The Ws and your inputs, which for us are runs or runs allowed.

01:13:10.000 --> 01:13:23.000
So for this I'm going to make scatter plots of my training, sets, wins against runs, and runs aloud.

01:13:23.000 --> 01:13:29.000
Okay. So over here on the side, on the left hand side plot, I've got wins on my vertical axis and runs scored on my horizontal and then on my right hand plot.

01:13:29.000 --> 01:13:41.000
I've got wins on the vertical and runs aloud on the horizontal, and so I would say, based on the horizontal. And so I would say, based on this these 2 plots.

01:13:41.000 --> 01:13:53.000
These look like linear relationships to me. So simple linear regression would be an appropriate model for either of these potential relationships.

01:13:53.000 --> 01:13:57.000
So that's why we're okay to use simple linear regression.

01:13:57.000 --> 01:14:08.000
If we were to look at this, and it doesn't look like a linear relationship at all, we might want to try something else, and we'll talk about that in the coming lecture notebooks.

01:14:08.000 --> 01:14:09.000
So then we're gonna choose some candidate models to try.

01:14:09.000 --> 01:14:21.000
So the first model would just be regressing wins on runs and then the second model is going to be regressing, wins on, runs aloud.

01:14:21.000 --> 01:14:26.000
An important step in any predictive modeling project is to have what's known as a baseline model.

01:14:26.000 --> 01:14:33.000
So these are models that aren't necessarily good, but they allow us to give some sort of context to see how good our models are in general.

01:14:33.000 --> 01:14:38.000
So it's hard to tell whether or not a model is good.

01:14:38.000 --> 01:14:40.000
So like, let's say we go through this process and it turns out our model has a mean, squared error of a hundred.

01:14:40.000 --> 01:14:59.000
It's hard to tell in the abstract, if that is good, we want to have something to compare our model with a very simple model, to compare our more complicated models with the say, Okay, these models are outperforming the simpler one.

01:14:59.000 --> 01:15:06.000
And you know, for instance, let's say, our baseline model, the very simple one that we're going to compare to.

01:15:06.000 --> 01:15:08.000
Let's say it had an Msc. Of a thousand.

01:15:08.000 --> 01:15:12.000
So in this case, if we were able to find a model that had an Msc.

01:15:12.000 --> 01:15:20.000
Of 100. This model is a good improvement over the baseline, but, for instance, if our baseline had an Msc.

01:15:20.000 --> 01:15:25.000
Of 10, our more complicated model would be underperforming the baseline.

01:15:25.000 --> 01:15:26.000
So it has a worse error, and it's more complicated.

01:15:26.000 --> 01:15:33.000
So we shouldn't stick with that one even if it is the best one among the ones we've tried.

01:15:33.000 --> 01:15:40.000
So basically, that's the idea of having a baseline model as you just want to do a sanity check of like, is this any better than the baseline that's that's the whole goal.

01:15:40.000 --> 01:15:50.000
So a really good baseline to start out with when you're just starting a problem fresh and regression is to just say that there is no relationship, basically.

01:15:50.000 --> 01:16:04.000
Saying that the number of wins is independent of the runs or the runs allowed, so the number of wins is the extension value, plus some random noise.

01:16:04.000 --> 01:16:05.000
So to do this we would estimate this with just the average.

01:16:05.000 --> 01:16:11.000
The, the arithmetic mean? Okay? So those are the the 3 models.

01:16:11.000 --> 01:16:19.000
Model 0, our baseline just taking the average and always predicting that model.

01:16:19.000 --> 01:16:27.000
One regressing wins on to runs, and then model 2 regressing wins onto, runs aloud.

01:16:27.000 --> 01:16:39.000
So to do this, to see what of these 3 models performs best, and then see if our models, one and 2 outperform model 0, we're gonna do K-fold cross validation.

01:16:39.000 --> 01:16:50.000
So here I import k fold. And now I'm gonna make my K-fold cross validation. So here I import K fold. And now I'm gonna make my k-fold object with 5 spl shuffle will be true.

01:16:50.000 --> 01:16:54.000
And then so you can compare it later on. Your computer will do a random state.

01:16:54.000 --> 01:17:00.000
And let's do. 6, 1, 6.

01:17:00.000 --> 01:17:08.000
Okay, so we want to calculate the mean squared error for all 3 of these models across all of the cross validation splits.

01:17:08.000 --> 01:17:15.000
Now, I could do this by hand, but Scikit learn has a function that I can use called the mean squared Error.

01:17:15.000 --> 01:17:22.000
It takes in the true values along with the predicted values, and then we'll output the Msc.

01:17:22.000 --> 01:17:27.000
So I'm gonna import the mean squared error along with my regression model.

01:17:27.000 --> 01:17:30.000
And so now I don't have to calculate it by hand.

01:17:30.000 --> 01:17:36.000
I can just input what my prediction is along with what the actual values are.

01:17:36.000 --> 01:17:45.000
So, what I'm gonna do now is I'm gonna create an array of zeros and this array of zeros is going to keep track of the mean squared error.

01:17:45.000 --> 01:17:49.000
On the holdout set from my cross. Validation splits.

01:17:49.000 --> 01:17:59.000
So the 3 represents the fact that I have 3 models, and the 5 represents that I'm doing 5 fold cross validation.

01:17:59.000 --> 01:18:06.000
So 4 train in that test index in K. Fold.

01:18:06.000 --> 01:18:13.000
Dot split heck. It was bb train.

01:18:13.000 --> 01:18:21.000
What am I gonna do first? I'm gonna get my training splits from this particular.

01:18:21.000 --> 01:18:27.000
Should probably actually do this as dot copy as well. Do get my training set from this cross validation split.

01:18:27.000 --> 01:18:40.000
So I'm gonna do bb train dot train index dot copy I guess if I do that I don't need this copy.

01:18:40.000 --> 01:18:44.000
Alright, and then I'm gonna get my hold out.

01:18:44.000 --> 01:18:51.000
Set bb, train, dot Luke test index dot copy here.

01:18:51.000 --> 01:18:54.000
I'm getting the mean prediction from my baseline.

01:18:54.000 --> 01:18:55.000
So I just take the mean number of wins for the training split from the cross-sidation times.

01:18:55.000 --> 01:19:06.000
A vector of ones. That's the length of the holdout set now, I'm gonna go through and make my linear regression model.

01:19:06.000 --> 01:19:16.000
So, linear regression. Happy X. Equals! True. Now I'm going to fit it on the training data.

01:19:16.000 --> 01:19:19.000
So dot fit, eb underscore t underscore t underscore T.

01:19:19.000 --> 01:19:30.000
So what was model one was for on runs. So model one dot R.

01:19:30.000 --> 01:19:42.000
And then dot values dot reshape and then Bv underscore t underscore t dot w dot values alright, so you can do this just as a note.

01:19:42.000 --> 01:19:43.000
You can do this with the dot values and just use like the columns themselves.

01:19:43.000 --> 01:19:53.000
I prefer to do it with the dot values, so sk learn as of a recent update as of like.

01:19:53.000 --> 01:20:03.000
Maybe last year, if you use the columns themselves, it will always want you to provide input with the column names, and if it doesn't, it will give you a warning and I really dislike seeing the red warning box and so that's why I trained them to numpy.

01:20:03.000 --> 01:20:11.000
Raise here. It would work if without the dot values as well.

01:20:11.000 --> 01:20:17.000
Okay. So now I'm gonna make my prediction. Now that the models fit at this step, I'm gonna make my prediction.

01:20:17.000 --> 01:20:27.000
So model one dot per on the training set or sorry on the holdout set Bb.

01:20:27.000 --> 01:20:38.000
Underscore H. O. Dot, are that values dot reshape at negative one comma one.

01:20:38.000 --> 01:20:41.000
And then this is doing the same thing. But offer model 2.

01:20:41.000 --> 01:20:46.000
So you don't have to watch me type it all again.

01:20:46.000 --> 01:20:47.000
So now as we're going through the splits, we're getting our training set.

01:20:47.000 --> 01:20:54.000
Our hold out set, we fit the baseline model and get the predictions.

01:20:54.000 --> 01:21:02.000
We fit the models model one model and get the predictions we fit the model to model and get the predictions.

01:21:02.000 --> 01:21:10.000
And now we just have to record the mean, squared error for that we do mean underscore squared, underscore error.

01:21:10.000 --> 01:21:16.000
Then we would do. Bb, hold out dot w dot values.

01:21:16.000 --> 01:21:30.000
So the true values. Remember, from our documentation the true values of Y minus the predicted values of y, which I store for this model, and a variable called P. R, E. D.

01:21:30.000 --> 01:21:37.000
One, and then this is just helping me keep track of the crossoveration split. I'm on.

01:21:37.000 --> 01:21:40.000
Oh, no! What did I do? I think I want them to be eyeloke.

01:21:40.000 --> 01:21:47.000
Yeah, so these should be, Ilo. That's what I thought. But.

01:21:47.000 --> 01:21:52.000
So now that I have that I can go ahead, and what I'm showing here is for each of my 3 models.

01:21:52.000 --> 01:21:57.000
So the baseline model one and model 2. The black circles represent the cross.

01:21:57.000 --> 01:22:08.000
Validation Msc. For a single one of the splits, and then the red larger circles represent the mean cross validation, Error.

01:22:08.000 --> 01:22:19.000
Across all 5 splits, so we can see here that the model that performs best from the cross foundation is the one with the lowest Msc.

01:22:19.000 --> 01:22:23.000
Which is model 2. Here.

01:22:23.000 --> 01:22:30.000
Okay. And so then, in this world, you know, in the real world, you'd probably try more than 2 models in this world.

01:22:30.000 --> 01:22:33.000
There's nothing else that we can try, based on what we know.

01:22:33.000 --> 01:22:42.000
So, you know in the real world you would do some additional modeling, like trying different models and comparing that to model 2.

01:22:42.000 --> 01:22:43.000
But you know we're done with that for the sake of this notebook.

01:22:43.000 --> 01:22:46.000
So let's imagine we've done that. And it turns out model 2 is still our best choice.

01:22:46.000 --> 01:22:51.000
So we're going to select model 2. What you then do is sort of a test set.

01:22:51.000 --> 01:23:03.000
Sanity check. You're gonna take your model refitted on the entire training set.

01:23:03.000 --> 01:23:08.000
And then calculate the training set Msc. Along with the test set Msc.

01:23:08.000 --> 01:23:20.000
Which is what I'm doing here. And so you can see that you know the training set as expected well, unexpectedly, this doesn't really happen usually, but the training set has a worse Msc.

01:23:20.000 --> 01:23:22.000
Than the test set. That doesn't usually happen. It can happen because it's just sort of random, right?

01:23:22.000 --> 01:23:29.000
But they are comparable. They're not like vastly different from one another.

01:23:29.000 --> 01:23:30.000
And so here I would say, Okay, like, our model isn't doing something unexpected.

01:23:30.000 --> 01:23:36.000
And we're also clearly not having an error with the way that our model was fit.

01:23:36.000 --> 01:23:41.000
So we're okay to, you know. Take this and put it into production.

01:23:41.000 --> 01:23:48.000
If we wanted to, so that's the idea.

01:23:48.000 --> 01:23:53.000
And then, if we were to find an error here, or something like that, like, let's say, this test set Msc.

01:23:53.000 --> 01:23:59.000
Was much larger than the training set, or much much smaller than the training set.

01:23:59.000 --> 01:24:04.000
Then we would want to look at the code that we use to fit the model as well as maybe the actual data to see.

01:24:04.000 --> 01:24:11.000
Like, okay, did the test set have some weird outliers or something, you know, just to give ourselves.

01:24:11.000 --> 01:24:12.000
That's the point of the test set is the sanity check. Okay?

01:24:12.000 --> 01:24:13.000
So that's the whole process. With the last 3 min.

01:24:13.000 --> 01:24:17.000
It's now is a great time for questions, and that's sort of like today.

01:24:17.000 --> 01:24:28.000
Those that's today's lecture. So if there are any questions, now's a great time to ask.

01:24:28.000 --> 01:24:30.000
Hello!

01:24:30.000 --> 01:24:31.000
Hi!

01:24:31.000 --> 01:24:41.000
People are setting the baseline with that? Is there any criteria to determine the shape or formula for the baseline?

01:24:41.000 --> 01:24:48.000
So with regression problems, it's typical to start out with just the one that we did where you'll choose the expected value of your output.

01:24:48.000 --> 01:24:59.000
Then once you go through, like, you know, we've gone through and done this like, and found that model to so like.

01:24:59.000 --> 01:25:06.000
Now, we want update our baseline to just be the simple linear regression model because that's a simple model.

01:25:06.000 --> 01:25:07.000
It can be fit very quickly, and it gives us a reasonable performance that we can compare it to.

01:25:07.000 --> 01:25:10.000
So that's what we might do to start out with you.

01:25:10.000 --> 01:25:23.000
Typically will like, assume, like, there's no relationship. And then you might update your baseline as you continue and identify better models.

01:25:23.000 --> 01:25:35.000
Is it the simplest way in your mobile, or setting the baseline with that?

01:25:35.000 --> 01:25:36.000
Hmm!

01:25:36.000 --> 01:25:43.000
Yeah, so like the baseline, the idea behind a baseline is you typically do want to have a simple model to compare it to so like here, the simplest model is just assuming there's no relationship.

01:25:43.000 --> 01:25:44.000
But in practice like that might not be like a great comparison.

01:25:44.000 --> 01:25:51.000
So like, you might then like, say, Okay, simple, like, linear regression.

01:25:51.000 --> 01:26:00.000
Usually a pretty simple model. So you compare it to that. And if it can't outperform linear regression, then you just might use linear regression.

01:26:00.000 --> 01:26:02.000
Hi! Gordy! Thanks!

01:26:02.000 --> 01:26:05.000
Yup, I see that we have a question from Steven.

01:26:05.000 --> 01:26:06.000
We have 3 variables, Wr and Ra. We have modeled functions regressing Wnr.

01:26:06.000 --> 01:26:16.000
And W. On Ra. If we also modeled, runs with runs aloud with linear regression.

01:26:16.000 --> 01:26:24.000
What the resulting triangle of functions commute, and other words, would our project are a predict?

01:26:24.000 --> 01:26:29.000
W be the same as our predict. W. Sorry for the naive question.

01:26:29.000 --> 01:26:30.000
I know 0 Stephen. No need to apologize at all.

01:26:30.000 --> 01:26:37.000
That's what these questions are for, you know, asking the questions and getting the answer.

01:26:37.000 --> 01:26:42.000
So in this case I would say that the causal well, I don't know.

01:26:42.000 --> 01:26:55.000
I think it would be a little bit difficult to make an argument that, like your runs, cause your runs allowed, because in baseball, like offensive performance, is typically somewhat independent of defensive performance.

01:26:55.000 --> 01:26:56.000
Now I think you could probably make arguments that's it's a little more subtle than that.

01:26:56.000 --> 01:27:04.000
But I think it wouldn't typically do something where you take one of your features and then like use like one feature to predict the other feature.

01:27:04.000 --> 01:27:13.000
Then use those predicted features to predict the output. So like you're adding a layer of complexity.

01:27:13.000 --> 01:27:19.000
There, now, that being said, there are some cases where you would do something like that called imputation.

01:27:19.000 --> 01:27:35.000
I don't know if we'll talk about that in the live lecture, but you can see an example of imputation and the pre-recorded videos where, if you had a missing value for a particular feature, you might use the other features to fill in that missing value, to then used

01:27:35.000 --> 01:27:36.000
for the predictive model it's a slightly different situation.

01:27:36.000 --> 01:27:46.000
But in general you would not use like predicted, like the entire column of predicted features, to then predict the output.

01:27:46.000 --> 01:28:06.000
You'll stick with the ones that you've been given. Because, going back to that earlier question, we're assuming that these predicted value are these features are set in stone, and then using that to predict the output.

01:28:06.000 --> 01:28:07.000
Yeah.

01:28:07.000 --> 01:28:11.000
I have a question. When we observe that both R.

01:28:11.000 --> 01:28:20.000
And Ra have linear correlation with the output, and then and then we decided to use by these to be the features.

01:28:20.000 --> 01:28:26.000
But since both of them have linear correlation with the output, can we say that any linear combination of R.

01:28:26.000 --> 01:28:30.000
And R. A. We'll have linear correlation with the output.

01:28:30.000 --> 01:28:36.000
And is there a way to find the best linear combination of R and Ra.

01:28:36.000 --> 01:28:42.000
Such that the model will be the best in terms of the lowest Ms.

01:28:42.000 --> 01:28:48.000
Yeah, so what we would try like, you know, we're restricted in this particular notebook. We're self.

01:28:48.000 --> 01:28:51.000
The imposing, the restriction that we only know simple linear regression, because that's all we've covered so far.

01:28:51.000 --> 01:28:58.000
But in the next notebook tomorrow we'll learn about something called multiple Linear Regression.

01:28:58.000 --> 01:29:04.000
And so what you would do in practice is you could, would compare this to the model regressing W.

01:29:04.000 --> 01:29:05.000
On both R. And R. A. And what in that case, then, the actual model will find the best meaning.

01:29:05.000 --> 01:29:24.000
The lowest Nsc. It can find. The best coefficients for are in Ra it's not something where we would systematically go through test different coefficients by hand, and see what's best.

01:29:24.000 --> 01:29:31.000
We let the model do that for us, you know, algorithmically.

01:29:31.000 --> 01:29:32.000
Yeah.

01:29:32.000 --> 01:29:33.000
Thanks.

01:29:33.000 --> 01:29:34.000
Yeah.

01:29:34.000 --> 01:29:40.000
I just had a question. So I'm just kind of going back to your the validation lecture.

01:29:40.000 --> 01:29:50.000
I was just curious. So so is the point of validation to try to give you a better estimate of the error.

01:29:50.000 --> 01:29:52.000
Is that what it's design? What's the output?

01:29:52.000 --> 01:29:56.000
I didn't quite follow the output of that exercise.

01:29:56.000 --> 01:30:00.000
Yeah, so each one of these, let's let's use this picture as an example.

01:30:00.000 --> 01:30:06.000
So like these are are these little black dots are the errors on the holdout sets from the cross.

01:30:06.000 --> 01:30:08.000
Validation. So the idea here is the validation set will give you an estimate like one estimate.

01:30:08.000 --> 01:30:22.000
And so what you could ultimately end up doing if you use a validation set approach is just build the model that performs best on that single validation set.

01:30:22.000 --> 01:30:32.000
So the idea with crosst validation, which is why it's generally preferred, is, we're now getting to see the performance of these 3 different models on 3 on 5.

01:30:32.000 --> 01:30:37.000
In this case 5 different validation sets granted, you're using different training sets as well.

01:30:37.000 --> 01:30:39.000
So it's it's a little bit, you know.

01:30:39.000 --> 01:30:42.000
It's you kind of squint a little bit.

01:30:42.000 --> 01:30:54.000
But the idea being like now, in addition to, in addition to seeing like, okay, on average, you know, model one does better than the baseline and multitude does better than model one.

01:30:54.000 --> 01:30:55.000
You can also get a sense like if you look at the different splits, you can get a sense of like, okay, like.

01:30:55.000 --> 01:31:09.000
And almost all of the splits model 2 did better it wasn't like the case, for, like model, one had like one that it performed really poorly on but on the rest it did better.

01:31:09.000 --> 01:31:26.000
So I guess the short answer to your question is, yes, the cross validation gives you a better estimate of, like the generalization error in that it's estimating sort of like what the average generalization error would be as opposed to just a single generalization error.

01:31:26.000 --> 01:31:29.000
Okay, I see. Thanks.

01:31:29.000 --> 01:31:34.000
Yes, and then asks, Where can I find the recorded lectures?

01:31:34.000 --> 01:31:39.000
Yes, so those are on the webinar. You go down to the program content.

01:31:39.000 --> 01:31:40.000
And there are videos. So you can either go through the process of scrolling all the way down to the process of scrolling all the way down to the bottom.

01:31:40.000 --> 01:31:47.000
But Roman has graciously provided a filter button.

01:31:47.000 --> 01:31:59.000
So you click on that filter and then search for the the label like may live lectures or something like that, and you can find it there.

01:31:59.000 --> 01:32:00.000
Yahweh is asking is only the mean value from each model compared, or is the Msc.

01:32:00.000 --> 01:32:11.000
Computed from each model. So each of these black dots is the Msc.

01:32:11.000 --> 01:32:17.000
On a particular holdout split so like going back to that picture like the left out set.

01:32:17.000 --> 01:32:23.000
That's what each of these dots represents. So you could if you wanted to try and get a sense of the comparing, all the splits to one another.

01:32:23.000 --> 01:32:42.000
But what's generally done in practice, because you probably won't have like the time to go through and do that is, you'll just compare the average of these, hold out values to one another.

01:32:42.000 --> 01:32:52.000
Sure!

01:32:52.000 --> 01:32:53.000
So it.

01:32:53.000 --> 01:32:54.000
So, yeah, so can I clarify my questions. So it's not just one average value from that distribution copy computed, just one efficiency value from that leads hold out.

01:32:54.000 --> 01:32:59.000
Yeah, so it's like the. So for the splits, there are 5.

01:32:59.000 --> 01:33:12.000
Holdout sets. We calculate the error on those 5 average them together, and that gave us the red dot for each of the models. You then compare that average across models to figure out which one performed best on average.

01:33:12.000 --> 01:33:22.000
Okay, so so that if I know there's a a mean, square error divided by the end.

01:33:22.000 --> 01:33:27.000
So in this case the end numbers should be just one.

01:33:27.000 --> 01:33:38.000
So the means note. So the n in in the formula for mean squared error.

01:33:38.000 --> 01:33:48.000
Do. This end refers to the number of observations in the training set, or in whatever set you're looking at.

01:33:48.000 --> 01:33:49.000
Okay.

01:33:49.000 --> 01:33:50.000
So if it was the on the training set, you know, we typically call that.

01:33:50.000 --> 01:33:58.000
And if it's on the test set or hold out set, it's whatever the size is. So like, whatever the number of data observations are. That's where the end comes from.

01:33:58.000 --> 01:34:01.000
Okay. Okay. I see. Okay. Gotcha. Okay. Thank you.

01:34:01.000 --> 01:34:03.000
Yup!

01:34:03.000 --> 01:34:19.000
Alright, maybe like one more question, and then we'll sign off for today.

01:34:19.000 --> 01:34:21.000
Okay. So there isn't seem to be any more questions.

01:34:21.000 --> 01:34:26.000
Thank you so much for everyone that stuck around. That was day number 2.

01:34:26.000 --> 01:34:29.000
I'll upload the video later tonight and you'll be able to find it if you'd like to rewatch it later.

01:34:29.000 --> 01:34:36.000
Okay.