WEBVTT

00:00:00.000 --> 00:00:02.000
Alright. So I'm gonna start recording now. Alright, hi!

00:00:02.000 --> 00:00:03.000
Everybody! Welcome back. This is Lecture day number 4, so today we're going to continue learning about regression.

00:00:03.000 --> 00:00:07.000
So we'll hopefully get through 4 or so lectures, and then that will be it.

00:00:07.000 --> 00:00:31.000
But before we get started with any of the lectures, does anyone have questions about anything?

00:00:31.000 --> 00:00:35.000
Okay.

00:00:35.000 --> 00:00:39.000
So I'm gonna go ahead. And this is my, your notebook.

00:00:39.000 --> 00:00:46.000
Let me get my chat window open. Okay? So we're gonna start.

00:00:46.000 --> 00:00:49.000
Continue on with what we are learning about regression. So yesterday we ended with a first predictive modeling project but in terms of regression we had only learned simple linear regression.

00:00:49.000 --> 00:01:09.000
We also talked about things like data splits. And the whole idea of supervised learning and predictive modeling so today, we're going to continue on with regression and learn about multiple linear regression.

00:01:09.000 --> 00:01:24.000
As our first step. So multiple linear regression are just regression models using more than one feature in our data sets more than one column in our data, sets more than one column to predict the output so that's what we're gonna learn about

00:01:24.000 --> 00:01:32.000
today in this first notebook. So the multiple linear regression model is just a quick extension of simple linear regression model.

00:01:32.000 --> 00:01:36.000
So we're going to assume that we have n observations of some output.

00:01:36.000 --> 00:01:42.000
Y, and then along with that, a set of M features stored in Vectors X.

00:01:42.000 --> 00:02:02.000
One x 2. All the way up to X M. Then the multiple linear regression model regressing Y on those Xi is given by Y equals beta 0 plus beta one x one plus beta 2 x 2 plus dot dot dot plus beta m xm plus the same random noise normally distributed

00:02:02.000 --> 00:02:20.000
0 standard deviation constant. So you can rewrite this using linear algebra and sort of as a matrix, just a matrix x times, a vector beta, where here Beta is a vector of the Beta 0 through beta, M.

00:02:20.000 --> 00:02:30.000
X is a matrix that has the first column of the matrix, being ones, and then their remaining columns of the matrix being those vectors.

00:02:30.000 --> 00:02:34.000
So x, one, x, 2, all the way to xm. So this is just a really quick extension of that simple linear regression model.

00:02:34.000 --> 00:02:47.000
Before we talk about how to fit this with sk, learn, we're going to talk about how you in theory fit it.

00:02:47.000 --> 00:02:57.000
Remember if you're not someone who's interested and like the mathematics behind fitting a model that's perfectly fine you can stay tuned until we get back to the escalar inside of things.

00:02:57.000 --> 00:03:03.000
So the same. The idea of fitting this model is the exact same idea as fitting the simple linear regression model.

00:03:03.000 --> 00:03:11.000
We want to minimize a loss function. And in regression problems, the most common loss function is mean, square error.

00:03:11.000 --> 00:03:17.000
So to admit. Remember, this is the sum from one equals, or I equals one to N.

00:03:17.000 --> 00:03:25.000
Of the actual observation, minus the predicated observation, squared, and then you add all those up, divide by N, that's the mean part.

00:03:25.000 --> 00:03:31.000
The sum part is the adding of it up. Sorry, mean, squared error.

00:03:31.000 --> 00:03:32.000
So the square part is the square on the difference, and the error part is the actual minus the predicted.

00:03:32.000 --> 00:03:41.000
So if I put this in using the model, the prediction is going to be X times Beta hat or beta hat is the estimate of my coefficients.

00:03:41.000 --> 00:03:42.000
So that's how you get this. And then, if you rewrite it with some linear algebra, you'll get that.

00:03:42.000 --> 00:03:55.000
The mean squared error is equivalent to the mean squared error is equivalent to the mean squared error is equivalent.

00:03:55.000 --> 00:04:00.000
Long expression here that I'm not gonna say with my words.

00:04:00.000 --> 00:04:17.000
You can then go ahead and use a matrix calculus to take the derivative of this with respect to the coefficients, set it equal to 0, and you find out that the estimates of the betas are given by this formula X transpose times X you take the inverse of that times.

00:04:17.000 --> 00:04:34.000
X transpose times. Y. So this is also known. This estimate of the betas is known as the ordinary least squares estimate so least 2 squares because we're missing the sum of squares of the coefficient vector.

00:04:34.000 --> 00:04:36.000
This formula is also sometimes called the normal equation. So this beta hat equals.

00:04:36.000 --> 00:04:56.000
This is known as the normal equation solution. To estimating the beta's so let's go back and show how you would do this and escalate and then possibly also a numpy.

00:04:56.000 --> 00:05:00.000
Is there a question?

00:05:00.000 --> 00:05:05.000
Okay, so we're gonna use that baseball data again.

00:05:05.000 --> 00:05:11.000
So just as a reminder. Here's what that data looks like.

00:05:11.000 --> 00:05:20.000
Okay. So we've got teams, years, leagues, games played, wins, losses, runs and runs allowed.

00:05:20.000 --> 00:05:25.000
So as these is asking, why would X transpose X be invertible?

00:05:25.000 --> 00:05:31.000
So it maybe isn't invertible. But, like in most like most real world applications of X of the model, you're gonna have it invertible matrix.

00:05:31.000 --> 00:05:40.000
It's very common for it to be invertible, but it doesn't necessarily have to be invertible.

00:05:40.000 --> 00:05:55.000
No!

00:05:55.000 --> 00:05:59.000
Okay, so let's make our train test split. So we did that.

00:05:59.000 --> 00:06:03.000
We saw this in an earlier notebook, and so we're gonna fit the file.

00:06:03.000 --> 00:06:04.000
Yeah.

00:06:04.000 --> 00:06:09.000
Why do you use dot copy here? I mean, I haven't seen that in early May.

00:06:09.000 --> 00:06:14.000
I mean earlier places. I'm not in this course in other works.

00:06:14.000 --> 00:06:15.000
What is a?

00:06:15.000 --> 00:06:28.000
Yeah, so so baseball. If you were to put this in here, the train and test split would give you sorry the train and test split.

00:06:28.000 --> 00:06:38.000
Would give you like, technically references of the original rows, so like if I didn't have a copy here.

00:06:38.000 --> 00:06:46.000
Rate Bb, train and Bb test would end up, pointing back to the rows of the original data frame.

00:06:46.000 --> 00:06:56.000
So if I did something to Bb train or to Bb tests rows, I would accidentally potentially be altogether the original data frame.

00:06:56.000 --> 00:07:01.000
So the way python stores things in memory as you can kind of imagine.

00:07:01.000 --> 00:07:04.000
Iphone when you define an object, Python will put that object in like a hypothetical box in your computer.

00:07:04.000 --> 00:07:16.000
Then the variable is just a name pointing to the box, and so one you make like.

00:07:16.000 --> 00:07:17.000
Let's say, I just did. Bp train bb, test.

00:07:17.000 --> 00:07:19.000
Those would then just be pointing to the relevant rows of the original data frame.

00:07:19.000 --> 00:07:29.000
So this box. But it's not a unique copy of it. In order to make sure you have a unique copy, you have to put in a an actual copy of the object.

00:07:29.000 --> 00:07:37.000
So you have to do baseball, dot copy which is enforcing the baseball copy which is enforcing the data frame to actually have a unique new copy of it be made.

00:07:37.000 --> 00:07:55.000
And then those are the things that are being pointed to by Bb train and Bb tests.

00:07:55.000 --> 00:07:58.000
So now we're gonna go ahead and try and fit this model.

00:07:58.000 --> 00:08:01.000
So we're gonna regress the number of wins on. Yeah.

00:08:01.000 --> 00:08:02.000
Hmm! Sorry. I think, that we tend to speech.

00:08:02.000 --> 00:08:10.000
Probably I think, if I remember it's probably worked directly on a copy instead of.

00:08:10.000 --> 00:08:17.000
So instead of working on the original one, and then speed out the Vv training vv test, I do not.

00:08:17.000 --> 00:08:21.000
So my confusion is, is, is it still necessary?

00:08:21.000 --> 00:08:38.000
So am I experienced. It is necessary, if I've done this in the path, and it's possible that Sk Learn has updated its dot, its function. But in the past, when I've done it without the copy, and then I would try and do something on, the train set or the test set it would give me the

00:08:38.000 --> 00:08:43.000
warning of your doing, you know, basically your changing the original data frame.

00:08:43.000 --> 00:08:46.000
When you do this I forget what the wording of the warning was.

00:08:46.000 --> 00:08:48.000
So in my past experience it has not been making a hard copy in the background. So that's why I do it.

00:08:48.000 --> 00:08:54.000
Okay. Let me see. Okay. Okay. Okay.

00:08:54.000 --> 00:08:59.000
Okay. So I'm regressing W. On R and R. A.

00:08:59.000 --> 00:09:11.000
So how do I do that? Well, first I'm gonna put it into the form that we talked about here so I'm gonna get a matrix X where the first column that we talked about here. So I'm going to get a matrix X where the first column is a column of ones.

00:09:11.000 --> 00:09:12.000
And then the other 2 columns will be my R.

00:09:12.000 --> 00:09:15.000
And my R. A, so that's what I'm doing here.

00:09:15.000 --> 00:09:16.000
So I'm making a matrix X trained I'm setting it to be a matrix of all ones.

00:09:16.000 --> 00:09:25.000
So 3 columns of ones, and then the other 2 columns, columns, one and 2.

00:09:25.000 --> 00:09:26.000
How are we gonna get overwritten with the R.

00:09:26.000 --> 00:09:35.000
And the Ra values from my training set, and then Y train is going to be the the vector of W's from my training set.

00:09:35.000 --> 00:09:43.000
So we could look at it if we'd like. So here is X train.

00:09:43.000 --> 00:09:50.000
Okay. So you see the column of ones. And then here is y train.

00:09:50.000 --> 00:09:53.000
Okay.

00:09:53.000 --> 00:10:01.000
So Zack's asking why we need to call M. Of one. So if we go back to our formula here, notice there's a beta 0 out front.

00:10:01.000 --> 00:10:02.000
And so, in order to be able to write it as X times, Beta.

00:10:02.000 --> 00:10:12.000
X. Has to have a column of ones out front, and then Beta's first entry has to be Beta 0.

00:10:12.000 --> 00:10:16.000
In that case should fit intercept, be turned off.

00:10:16.000 --> 00:10:20.000
So we're first doing it with Numpy, and then we'll talk about how to do it with.

00:10:20.000 --> 00:10:22.000
I see. Okay.

00:10:22.000 --> 00:10:23.000
Yup, okay, so how do we do this with numpy?

00:10:23.000 --> 00:10:26.000
Is he just use the normal equation. So let's go ahead and refer back.

00:10:26.000 --> 00:10:41.000
It's X transpose X inverse times. X transpose times y, so we have x dot transpose.

00:10:41.000 --> 00:10:45.000
And then dotted with X, and then around that we want to do.

00:10:45.000 --> 00:10:54.000
Np, dot Lynn, alge.in V, which stands for inverse, and then that will be dotted again with X dot.

00:10:54.000 --> 00:10:57.000
Transpose.

00:10:57.000 --> 00:11:05.000
And I guess this should also have the parentheses, and then that will be multiplied with y, and it.

00:11:05.000 --> 00:11:12.000
These should all be X trains, not X's, and y change.

00:11:12.000 --> 00:11:16.000
Okay. So now we can look at our estimates from the normal equation.

00:11:16.000 --> 00:11:24.000
So we estimate an intercept of 84.0 8 a beta, one hat of point O let's say just point 1.

00:11:24.000 --> 00:11:25.000
If we round up, and then a beta, 2 hat of negative point 1.

00:11:25.000 --> 00:11:39.000
If we're rounding. So to make the predictions, then, with the training set or sort of the fitted values we would just do the Beta hat at 0, which I have.

00:11:39.000 --> 00:11:42.000
You know we haven't. Had a Beta hat at one times.

00:11:42.000 --> 00:11:43.000
The one column of X train times. Beta had it 2 times.

00:11:43.000 --> 00:11:53.000
The 2 column of X strain. And that would allow us to calculate the Msc.

00:11:53.000 --> 00:11:56.000
On the Training Set, which is 16.9 5. Okay?

00:11:56.000 --> 00:12:00.000
So most times, you're not going to do this. You're just going to use Sk, wherein?

00:12:00.000 --> 00:12:15.000
But I think it's good to see the implementation of the normal equation sort of by hand, with numpy doing all the linear algebra for us.

00:12:15.000 --> 00:12:21.000
Sklearn dot linear model. We import linear regression.

00:12:21.000 --> 00:12:31.000
So we're still using the linear regression object. So sk, learn simple linear regression is the same as multiple linear regression.

00:12:31.000 --> 00:12:39.000
This was Zach's question earlier. So because X. Train has a column of ones in it, when we define our linear regression objects, we're going to put fit intercept equals false.

00:12:39.000 --> 00:12:50.000
So why do we want to do this? So by default, fit intercept equals?

00:12:50.000 --> 00:13:00.000
True, is assuming that your columns of X only have features, but because we have a column of ones out front in order to allow us to demonstrate the normal equations.

00:13:00.000 --> 00:13:10.000
Here we have to say fit intercepts equals, false, because the intercept in this particular case is going to be absorbed into the coefficients.

00:13:10.000 --> 00:13:17.000
Okay. And so then we'll do reg that fit X train y chain.

00:13:17.000 --> 00:13:21.000
Another thing you might notice is, remember, in simple linear regression.

00:13:21.000 --> 00:13:27.000
We had to do reshape. Here we don't have to do reshape, because X is a 2D. Array.

00:13:27.000 --> 00:13:35.000
So it is a 2D. Array. So it is a matrix, whereas before it was originally so, it is a matrix, whereas before it was a like a row vector so because it's a 2D array already, we don't need to do reshap, hey?

00:13:35.000 --> 00:13:38.000
So now we've fit our linear regression, and we can look at the coefficients and see, compare and contrast.

00:13:38.000 --> 00:13:46.000
So we've got Beta, 0 hat equals 84, 84 is the first coefficient in this array.

00:13:46.000 --> 00:13:54.000
Then we've got point O. 9 7 point O. 9 7, and then finally negative point 101, and then negative point 1.

00:13:54.000 --> 00:14:09.000
Oh, one. Okay. If we wanted to make a prediction and escalar, we do the model dot predict, and then we could do X train.

00:14:09.000 --> 00:14:12.000
So we're just using the training set in this notebook.

00:14:12.000 --> 00:14:14.000
We're not gonna look at the test set at all.

00:14:14.000 --> 00:14:18.000
And then the Msc's also match. Okay?

00:14:18.000 --> 00:14:24.000
So it's possible. Hopefully, this wasn't too confusing with the adding of the calm of ones in order to allow us to see like how it relates to the normal equation setup.

00:14:24.000 --> 00:14:28.000
And also with this fit intercept, hopefully, that's not too confusing.

00:14:28.000 --> 00:14:43.000
But that's just it for multiple linear regression using continuous fe, and the next couple notebooks will expand upon this model to include captorical features.

00:14:43.000 --> 00:15:00.000
But before we do any of that, are there any questions about what we learned in this notebook?

00:15:00.000 --> 00:15:02.000
Awesome. Okay? So that's multiple linear regression. Where all of your features are continuous.

00:15:02.000 --> 00:15:13.000
So now, like it's totally possible that we want features that aren't continuously categorical features.

00:15:13.000 --> 00:15:15.000
So how do I deal with that? Well, we have to learn how we can add categorical variables and also interactions to our models.

00:15:15.000 --> 00:15:26.000
And to do that, we're going to use a new data set about some beer.

00:15:26.000 --> 00:15:32.000
So this data set, once it loads.

00:15:32.000 --> 00:15:39.000
What's my notebook's ready to go. We'll see what this data set looks like.

00:15:39.000 --> 00:15:43.000
Okay, so making my train test split from the very beginning.

00:15:43.000 --> 00:15:44.000
So, for now you might notice something new called Stratify.

00:15:44.000 --> 00:15:51.000
I'm gonna ask that. Ignore it for now, and we're gonna come back to it next week.

00:15:51.000 --> 00:15:57.000
Just stratifies the thing. We'll talk about it next week.

00:15:57.000 --> 00:15:58.000
Okay, so this data set has each row represents a different beer that exists in the world.

00:15:58.000 --> 00:16:09.000
Or at least it did. At 1 point it has the ibu, which stands for international bitterness, units.

00:16:09.000 --> 00:16:10.000
That's a column Abv, which stands for alcohol by volume.

00:16:10.000 --> 00:16:20.000
The rating from the website that it came by for so this was a website where you could users could input their rating out of 5 stars for different beers.

00:16:20.000 --> 00:16:30.000
And then the type of beer which is either a Ipa or a Stout.

00:16:30.000 --> 00:16:43.000
So we're gonna think of, we wanna build a model like maybe we have noticed that the more alcohol that is inside of a beer, the more bitter it tastes.

00:16:43.000 --> 00:16:53.000
And so we wanted to look at building a model that predicts Ibu using Abv, so here's what that data looks like.

00:16:53.000 --> 00:17:09.000
So I've got ibu on my vertical axis, because it's what I'm going to try and predict and abev on my horizontal cause. That's the feature. So right now, I'd say, this looks like it has you know, a linear relationship.

00:17:09.000 --> 00:17:10.000
It's not like the strongest linear relationship in the world, but it looks like to me like it would have a linear relationship.

00:17:10.000 --> 00:17:14.000
So, you know, we might have a baseline model if we're thinking of this from a predictive modeling standpoint.

00:17:14.000 --> 00:17:28.000
Our baseline model might be alright. Ibu is independent of Abv, so it's just the expected value of ibu plus random noise.

00:17:28.000 --> 00:17:34.000
Then our next model. We might be interested in is the simple linear regression model that we talked about yesterday, where I'm just gonna regress ibu onto Abv.

00:17:34.000 --> 00:17:45.000
Now let's go ahead, and then change this plot so that we also include information about beer types.

00:17:45.000 --> 00:17:52.000
So the 2 types of beers in this data set are stouts and Stouts and Ipa's.

00:17:52.000 --> 00:17:53.000
So this plot will make. Oh, sorry. So we're trying to look to see like if this will have an impact on Ibu.

00:17:53.000 --> 00:18:02.000
So one way to do this is something called a swarm plot.

00:18:02.000 --> 00:18:08.000
So here is a swarm plot of the Ibu for the 2 different beers stouts on the left.

00:18:08.000 --> 00:18:16.000
Ipa is on the right, and so in a swarm plot, each point or each observation in the data set is represented by a point.

00:18:16.000 --> 00:18:27.000
So this point here is one of the stouts. This point here is one of the Ipas, and it's just a way to visualize the distribution of the of the variable.

00:18:27.000 --> 00:18:34.000
Your interested in and what you're looking for when you're looking to see if categorical variables maybe have an impact on what you're interested in.

00:18:34.000 --> 00:18:38.000
You want to see like, do the distributions tend to overlap a lot?

00:18:38.000 --> 00:18:43.000
So, if they overlap quite a bit, then it's possible that they don't have an effect.

00:18:43.000 --> 00:18:55.000
If they're sort of offset from one another that indicates that maybe they do have an effect, and you'd like to keep it in the model another way that you could do this is just to recreate your earlier scatter plot.

00:18:55.000 --> 00:19:01.000
But then color the points by the category. Okay? So here's that plot from earlier.

00:19:01.000 --> 00:19:16.000
But now the Ipas are orange triangles, and the stouts are blue circles, and so we can see here that it appears that the Ipas tend to live higher up on the plot than the stouts of an equivalent AV v so this suggests to me that we might want

00:19:16.000 --> 00:19:20.000
to try, including the beer type category in our model.

00:19:20.000 --> 00:19:24.000
So how do we do that? We're going to learn how to do that.

00:19:24.000 --> 00:19:39.000
After I pause for questions.

00:19:39.000 --> 00:19:40.000
Yeah.

00:19:40.000 --> 00:19:43.000
Hi! Just a general question. So when you're building a model, is it important to check before?

00:19:43.000 --> 00:19:46.000
If the relationships are linear.

00:19:46.000 --> 00:19:54.000
Yeah. So if you're building a linear regression model, one of the key assumptions and that is that there is a linear relationship.

00:19:54.000 --> 00:19:59.000
So if you're thing that you're trying to predict does not have a linear relationship with the features you're trying to use to predict it.

00:19:59.000 --> 00:20:07.000
Then the linear regression model is probably not going to be a very good model.

00:20:07.000 --> 00:20:14.000
So that's one of the things you'd want to check. If you're considering using a linear regression model.

00:20:14.000 --> 00:20:22.000
So like from today's practice. For example, we were asked to calculate the Pearson coefficients and get the correlation.

00:20:22.000 --> 00:20:25.000
So is there like a particular value, that if it's like below point 5 then don't bother with like linear regression.

00:20:25.000 --> 00:20:33.000
Do something else like. There's some threshold where you can toggle.

00:20:33.000 --> 00:20:53.000
So I think I don't know of like this is the general rule of thumb, so I think like in my head, I usually use point 3 or like point 2, and if it's bigger and magnitude so either below negative point 3 or above positive point 3, I'll consider including it cause it might

00:20:53.000 --> 00:20:56.000
help and if it's like above, like point 5, I think, okay, this is a relatively strong correlation in magnitude so negative or positive.

00:20:56.000 --> 00:21:06.000
This is a relatively strong correlation, so we should probably include it.

00:21:06.000 --> 00:21:13.000
And then, if it's bigger than like, let's say point 7, then that's like a really strong correlation and we should definitely try and include it.

00:21:13.000 --> 00:21:22.000
So that's sort of like the thought process I go through. I don't know that there's like a general rule of thumb, for, like every.

00:21:22.000 --> 00:21:39.000
Just a quick question related to that. So you're gonna wanna use like all of the features that you have right as long as you think they're contributing to the result right?

00:21:39.000 --> 00:21:42.000
But like if one of them isn't linearly related to what you're trying to predict, does that just throw everything off?

00:21:42.000 --> 00:21:55.000
Would you just not include it, or would you use some kind of log to try and change it so that it would be linear?

00:21:55.000 --> 00:21:56.000
Yes.

00:21:56.000 --> 00:22:01.000
Or what do you do there? It just seems like it's like there's going to be one that's not linear.

00:22:01.000 --> 00:22:06.000
With respect to what you're trying to predict. So how do you deal with those situations?

00:22:06.000 --> 00:22:13.000
Yeah. So that's a good question. So like, typically, you're not going to use every single feature that you have.

00:22:13.000 --> 00:22:19.000
So a lot, we collect a lot of data in the world, and only a little bit of it is like relevant for what we want to do.

00:22:19.000 --> 00:22:25.000
So there's this process called feature selection, that we are gonna learn a couple of different approaches to feature.

00:22:25.000 --> 00:22:29.000
So I'm with the tool set that we've built up so far.

00:22:29.000 --> 00:22:30.000
Those approaches are really limited to just exploratory data.

00:22:30.000 --> 00:22:35.000
Analysis. So basically with that, you know, you might see something where it doesn't appear that there's a linear relationship.

00:22:35.000 --> 00:22:50.000
But like you said as we'll learn in the next notebook sometimes, just because it doesn't appear that there's a linear relationship doesn't mean that there's no relationship.

00:22:50.000 --> 00:22:51.000
Right.

00:22:51.000 --> 00:23:08.000
So, for instance, maybe a logarithmic relationship is appropriate, or maybe something like taking the square or the square root as appropriate so that comes in like, that's really hard to tell without doing like these making these types of plots and looking visually because if you only look at correlations something might have 0

00:23:08.000 --> 00:23:11.000
correlation, but then still have a relationship. There are other.

00:23:11.000 --> 00:23:17.000
So? Is is it not best practice, then, to use like as much cause?

00:23:17.000 --> 00:23:23.000
I feel like you want to use as much data as you have as long as it's relevant to the situation.

00:23:23.000 --> 00:23:24.000
Right. You want.

00:23:24.000 --> 00:23:34.000
Well, so that yeah, so like what you're saying, there is like the huge like, as long as so like, if the data you're using is improving, the predictive power of your model then you want to use it.

00:23:34.000 --> 00:23:35.000
But it's not always like throwing all the features into your model isn't going to always give you the best model.

00:23:35.000 --> 00:23:48.000
So sometimes there are features that are just not at all correlated with the thing that you're interested in.

00:23:48.000 --> 00:23:50.000
Then those aren't going to be included in the model.

00:23:50.000 --> 00:24:06.000
So like, for instance, like, for example, maybe somehow in this data set, you are able to get information on like the color of the beer label that's not at all gonna be related to the Ivu, so we wouldn't want to include it, even though, it's data that we might have but

00:24:06.000 --> 00:24:07.000
like.

00:24:07.000 --> 00:24:14.000
So is that where you just use the core like matrix, and you just pick out the ones that are above a certain threshold value.

00:24:14.000 --> 00:24:15.000
So!

00:24:15.000 --> 00:24:18.000
Or does that only tell you whether they're linearly related?

00:24:18.000 --> 00:24:19.000
Yeah. So like, I said earlier, like, correlation isn't the end all be all you have to do things like exploratory data.

00:24:19.000 --> 00:24:33.000
Analysis and look to see if there are other relationships, and you'll have to try other approaches that we're going to learn in later lectures.

00:24:33.000 --> 00:24:36.000
Okay, I'll look forward to those then. Sorry.

00:24:36.000 --> 00:24:38.000
Yeah, yeah.

00:24:38.000 --> 00:24:39.000
Can I meet?

00:24:39.000 --> 00:24:40.000
It's kind of a big, broad question, but.

00:24:40.000 --> 00:24:42.000
Yeah.

00:24:42.000 --> 00:24:43.000
Sure!

00:24:43.000 --> 00:24:52.000
Matthew, can I make a comment? I think I think if you restrict Jacob, if you restrict yourself only to linear recreations, then you will have.

00:24:52.000 --> 00:24:57.000
You will be having all these questions. I think Matthew was also saying that there are other approaches.

00:24:57.000 --> 00:25:01.000
Are not model that you can adopt at that stage where all these nonlinear relationship can be taken care of.

00:25:01.000 --> 00:25:18.000
So this cost and so will be kind of later at later point will be relevant when you learn. And when we learn more different types of complicated and more complex model.

00:25:18.000 --> 00:25:19.000
Yeah.

00:25:19.000 --> 00:25:30.000
Right? So the answer is, basically, if you have some data you want to use some features you want to use that are not linear, linearly related to the predicted predictive feature.

00:25:30.000 --> 00:25:31.000
But you want, like, you want to use them because they are related.

00:25:31.000 --> 00:25:40.000
In some way. You just wouldn't use a linear model.

00:25:40.000 --> 00:25:41.000
Alright!

00:25:41.000 --> 00:25:45.000
You would just use some of them, or you would use logs or whatever to turn it into linear.

00:25:45.000 --> 00:25:46.000
Yeah, right? So when you regret.

00:25:46.000 --> 00:25:49.000
I just wouldn't wanna throw it out, you know, to use a linear model.

00:25:49.000 --> 00:25:58.000
Yeah, so, yeah, yeah. So like, if you're seeing through your exploratory data analysis, that kind of thing like, there is a relationship.

00:25:58.000 --> 00:25:59.000
It's just not linear like you would want to change your model.

00:25:59.000 --> 00:26:09.000
But if you do, those explorations and it looks like that, it doesn't appear to be any relationship, then you would not use it.

00:26:09.000 --> 00:26:10.000
Yeah.

00:26:10.000 --> 00:26:11.000
Alright. Thank you.

00:26:11.000 --> 00:26:27.000
And then Melanie was asking to discuss again what we're checking for, and sort of these plots, and so to the idea here is, and I think part of it was I didn't explain it very well, because I was sort of remembering what I wrote earlier in the month so basically, the idea is

00:26:27.000 --> 00:26:32.000
we're just. I'm trying to demonstrate the process of like investing, whether or not it's worthwhile to include a categorical, variable in a linear regression model.

00:26:32.000 --> 00:26:44.000
And so one way to do that is to examine the distribution of the thing we're trying to predict which for us is ibu, and see if it's impacted in any way with the values of the categorical variable. So for us.

00:26:44.000 --> 00:26:54.000
That's either stout or Ipa. So here because these 2 appear to be slightly offset.

00:26:54.000 --> 00:27:01.000
So the mean or median appearing to happen here, for Ipas, and maybe over here for Stouts.

00:27:01.000 --> 00:27:07.000
That suggests to me that we might consider including this another plot type that you could make.

00:27:07.000 --> 00:27:11.000
That is more popular in different circles is, instead of a swarm plot.

00:27:11.000 --> 00:27:17.000
You could use a box plot, but you'll have to.

00:27:17.000 --> 00:27:20.000
This is why you should never like just, you know, Freewheel it, let's go ahead and try and see if that works.

00:27:20.000 --> 00:27:30.000
There you go. So this is like a box spot where this is showing the inter-core tile range, and I think I go over this in the problem session.

00:27:30.000 --> 00:27:37.000
You'll work on a Monday, but like so if you're into quartile ranges are almost not overlapping at all.

00:27:37.000 --> 00:27:38.000
This is so also sort of a suggestion that you'd want to include it.

00:27:38.000 --> 00:27:50.000
So basically you're just trying to look for evidence that okay, this category this categorical variable, does seem to be impacting my predictor or not.

00:27:50.000 --> 00:27:56.000
My predictor, my output variable that I want to predict, and if that's the case, then you want to consider using it.

00:27:56.000 --> 00:28:02.000
So that's sort of one way the other way is, you're basically just remaking the scatter plot.

00:28:02.000 --> 00:28:06.000
And then looking at the colors of the scatter like coloring.

00:28:06.000 --> 00:28:07.000
The markers according to the values of the categorical, variable.

00:28:07.000 --> 00:28:27.000
You're kind of seeing. Okay for a equivalent values of the Iba or the Abv Ipas tend to live above the stouts which suggests that there is sort of an impact on the Ibu by the type of beer.

00:28:27.000 --> 00:28:35.000
Okay, so now that we're happy, and we want to include beer type, we actually have to go through the logistical process of how do you include categorical variables in a linear regression model?

00:28:35.000 --> 00:28:52.000
And some of you may be familiar with this from statistics, coursework. Some of you may not be familiar with this, so the way to include a categorical variable in a model is you first have to do some data pre-processing.

00:28:52.000 --> 00:28:58.000
So these categorical variables are typically stored as either strings or sort of indicator numbers.

00:28:58.000 --> 00:29:01.000
So like they'll be numbers, but they don't actually mean number of things.

00:29:01.000 --> 00:29:09.000
They're just indicators. And so basically, like, strings are great for human readability. Right?

00:29:09.000 --> 00:29:11.000
I can look at this and say, Okay, this is an Ipa.

00:29:11.000 --> 00:29:14.000
This is a stack but they're really bad for regression models.

00:29:14.000 --> 00:29:18.000
So for regression models, so for regression models with categorical variables with categorical variables, you need to do something called one hot encoding.

00:29:18.000 --> 00:29:29.000
So one hot encoding is where you take a categorical, variable, and then you represent it as a series of zeros and ones, depending on the number of categories.

00:29:29.000 --> 00:29:34.000
So for us we have 2 categories so we're just going to need one variable 0 or one.

00:29:34.000 --> 00:29:45.000
But in general, if you have little K unique categories, then you need to create K minus 1 one hot encoded, or also known as indicator, variable.

00:29:45.000 --> 00:29:46.000
So what is an indicator or one hot encoded, variable?

00:29:46.000 --> 00:29:52.000
You will denote it with like one little sub. J.

00:29:52.000 --> 00:29:56.000
Where Jay is going to be one of the categories.

00:29:56.000 --> 00:30:01.000
So this variable will be one if you're observation is equal to J.

00:30:01.000 --> 00:30:07.000
So, for instance, if this was one sub-stout like I have down here, it would be a one.

00:30:07.000 --> 00:30:12.000
If the beer is a stout and a 0, otherwise, so your indicator variables are just going to be one.

00:30:12.000 --> 00:30:24.000
If your observation is that particular category or option for the category, and otherwise it will be a 0 so how can we do this in Python?

00:30:24.000 --> 00:30:27.000
There's a function called, get dummies from pandas.

00:30:27.000 --> 00:30:31.000
So what you can do, and then I'll demonstrate it here as you do.

00:30:31.000 --> 00:30:35.000
Pd. Dot get underscore dummy's you import your inputs not import.

00:30:35.000 --> 00:30:42.000
You input the column that you're interested in.

00:30:42.000 --> 00:30:47.000
So for us, because we have 2 beers. We only need one indicator, variable.

00:30:47.000 --> 00:30:50.000
So I chose to make that the stout indicator.

00:30:50.000 --> 00:30:56.000
So this is what oh, no! What did it do? Oh, cause there is no stout.

00:30:56.000 --> 00:31:00.000
I'm having a brain fart. Here we go, beer type.

00:31:00.000 --> 00:31:04.000
Okay, so beer type will take your get dummies.

00:31:04.000 --> 00:31:16.000
Take sudden your categorical column, and then produces a new data frame the columns of that data frame are 0 ones, depending upon the possible options for the category.

00:31:16.000 --> 00:31:30.000
So the first one is Ipa, so you'll have a 0 if the beer is not an Ipa and a one, if it is an Ipa and a one, if it is an Ipa, so we could.

00:31:30.000 --> 00:31:36.000
Down here. Let's look at the first 5 rows. Okay.

00:31:36.000 --> 00:31:40.000
So we can see the first beer in this training set is a stout.

00:31:40.000 --> 00:31:43.000
And so that's why Ipa is 0, and Stout is one.

00:31:43.000 --> 00:31:48.000
The second one is an Ipa. So that's why Ipa is one, and stout is 0.

00:31:48.000 --> 00:31:51.000
So then, because we have to remember, we only need one indeator. So we're going to choose the stout indicator, because that's what I wrote.

00:31:51.000 --> 00:32:01.000
In the notes, and so we just need to add a stout column to the training set.

00:32:01.000 --> 00:32:20.000
So we'll do. Pd, dot get underscore dummyies, beer train at beer type, and then here we just need the column for stout. So I'm going to do column Stout.

00:32:20.000 --> 00:32:24.000
Okay. And then we can go through and just check with the first 5 stout.

00:32:24.000 --> 00:32:28.000
So it's a one Ipa, so it's a 0.

00:32:28.000 --> 00:32:32.000
And if we were to go through and check it, it would hold true for all of it.

00:32:32.000 --> 00:32:39.000
Okay. So I know often the concept of the indicator variables and the concept of get dummies is can be confusing if you're seeing it for the first time.

00:32:39.000 --> 00:33:02.000
So does anyone have questions about get dummies are about the indicators.

00:33:02.000 --> 00:33:08.000
So Laura is asking, are dummies only used for categorical data?

00:33:08.000 --> 00:33:12.000
Yes, so!

00:33:12.000 --> 00:33:23.000
Trying to do it for like continuous data would be difficult cause usually end up with like a lot of columns, because in general continuous data can take on, you know, infinitely many possible values.

00:33:23.000 --> 00:33:28.000
So you use it for categorical data. You could also use it.

00:33:28.000 --> 00:33:31.000
So some people will call like, sometimes you'll have data that it's called.

00:33:31.000 --> 00:33:43.000
So some people will call like, sometimes you'll have data that is called ordinal data, which is technically categories. But the numbers that are provided have some sort of meaning, in which case, like you, you can sometimes use those directly.

00:33:43.000 --> 00:33:44.000
But other times people suggest that you use the indicators for it.

00:33:44.000 --> 00:33:52.000
So dummies are typically I'm gonna say, like, yeah, they're always used for categorical data.

00:33:52.000 --> 00:34:05.000
Yahweh asking, Can it be non integer? So are you asking if the dummies themselves can be not integers?

00:34:05.000 --> 00:34:11.000
Oh, sorry. Yeah, I think so. It's like that may be like point 1.2 point 3.

00:34:11.000 --> 00:34:19.000
Yeah, so I guess for other types of models they could be whatever like any 2 distinct numbers.

00:34:19.000 --> 00:34:23.000
But for one linear regression they specifically do have to be zeros and ones.

00:34:23.000 --> 00:34:24.000
Okay. Okay. Thanks.

00:34:24.000 --> 00:34:25.000
Okay, can I chime in a little bit with that? I'm not sure if this applies.

00:34:25.000 --> 00:34:37.000
But I was working on a recommendation engine at 1 point, and I had, like various categories for the things I was recommending.

00:34:37.000 --> 00:34:49.000
But then within like, if if you look in the specific category, it had like a ranking within that category.

00:34:49.000 --> 00:34:52.000
So I think what I end up doing which I'm not sure is correct.

00:34:52.000 --> 00:35:07.000
I suppose this is my question is, I did like a get dummies thing I did this categorical encoding, but each, instead of zeros and ones, it would have like a 0 if if it wasn't even in that category and then it would have

00:35:07.000 --> 00:35:13.000
the ranking within that category rather than a one.

00:35:13.000 --> 00:35:14.000
Yeah, so you can do.

00:35:14.000 --> 00:35:17.000
What I'm saying makes sense like is that valid? Or I don't know.

00:35:17.000 --> 00:35:22.000
Yes, so you can do something like that. Typically people. Are. So if you're gonna keep their rankings I believe you'd want to do a different type of like again.

00:35:22.000 --> 00:35:31.000
You. Maybe we're not doing a regression model. So it's hard to tell without knowing the model.

00:35:31.000 --> 00:35:35.000
But there are certain models that are built for like rank regression.

00:35:35.000 --> 00:35:37.000
I believe I've I don't know them off the top of my head.

00:35:37.000 --> 00:35:47.000
I'd have to do a web search, but in general I believe it's recommended for ranking data that you would still do the one hot encoding, because there's not necessarily the same integer meaning behind the ranking.

00:35:47.000 --> 00:36:01.000
So the sort of difference in people's minds between between something that gets a value of rank number one and between.

00:36:01.000 --> 00:36:08.000
And maybe the value of rank number 2, and then compare that to the difference between 3 and 2 may not be equal, in people's minds.

00:36:08.000 --> 00:36:21.000
So maybe they's a bigger gap between the ranking from 2 to 3 than there is from one to 2. And so in that case, I believe it is still recommended in a regression model that you would use one hot encoding for those.

00:36:21.000 --> 00:36:24.000
So, then how would you encode the the actual ranking?

00:36:24.000 --> 00:36:28.000
Would you have a separate feature for the ranking of?

00:36:28.000 --> 00:36:35.000
Yeah, so with. So I would say, this might be a good question to break like, come to me an office hours for, because it's a diving. Yeah, a little bit too specific.

00:36:35.000 --> 00:36:39.000
Okay, yeah, yeah, I can do that.

00:36:39.000 --> 00:36:45.000
Yeah. Yeah. And then Iov, sorry if I said your name, and correctly, what if you have more than 2 category?

00:36:45.000 --> 00:36:52.000
So if you have more than 2 possible categories, you would have so like if you had 3, you could make 2 indicators.

00:36:52.000 --> 00:36:54.000
If you'd have 4 you'd make 3 indicators.

00:36:54.000 --> 00:37:02.000
So if you have K possible categories, you need to make K minus one indicators and get dummies will do that for you.

00:37:02.000 --> 00:37:04.000
Get dummies, make'll do that, for you get dummies. Makes okay of them.

00:37:04.000 --> 00:37:08.000
But then you need to select K minus one of them.

00:37:08.000 --> 00:37:18.000
And then I saw somebody had their hand up. So if that person wants to ask if there's if they still have a question.

00:37:18.000 --> 00:37:20.000
It wasn't me, but I can ask later.

00:37:20.000 --> 00:37:26.000
Okay.

00:37:26.000 --> 00:37:29.000
Alright! So what's the model that we want to build now?

00:37:29.000 --> 00:37:41.000
So now that we have this stout, variable, which is the indicator of it, is or is not, a stout, we're just gonna regress ibu on both abv and onstack.

00:37:41.000 --> 00:37:45.000
And so we don't have to change like the models, like the object, is still the same.

00:37:45.000 --> 00:37:46.000
We also don't have to change like, if this was a normal equation situation, that also still applies.

00:37:46.000 --> 00:38:00.000
But since we're using sk, learn, all we have to do is just make sure that when we fit our model we include the stout column along with the Ab. B.

00:38:00.000 --> 00:38:07.000
Column, so here I define the model, and then I fit the model all on the training set, and in this code, Chunk, there's a lot of code here.

00:38:07.000 --> 00:38:13.000
It's just me plotting. So it's making a lot of code here. It's just me plotting. So it's making a scatter plot that we saw before.

00:38:13.000 --> 00:38:15.000
But now I'm inputting the model output for those 2 different categories.

00:38:15.000 --> 00:38:36.000
So the ipas are the orange dotted lines, the stouts is the solid blue line, and one thing you might notice here is that they both have the same slope, and from looking at the scatterplot data, it's reasonable to say that doesn't look like they should have the

00:38:36.000 --> 00:38:42.000
same slope, you might suggest that the Ipas may look like they should have a steeper slope.

00:38:42.000 --> 00:38:45.000
So how do I change that? Well, let's go back to the model itself that we just fit.

00:38:45.000 --> 00:38:51.000
So we just fit this model that I'm highlighting with my mouse.

00:38:51.000 --> 00:38:52.000
And so if you look at this model, let's go through what happens.

00:38:52.000 --> 00:39:01.000
When we change from a beer that isn't Ipa meaning stout is equal to 0 to a beer that is a stout meaning.

00:39:01.000 --> 00:39:09.000
Stout equals one. So in the case when stout is 0, the model reduces to this so beta 0 plus beta 1 80 b.

00:39:09.000 --> 00:39:25.000
And when stout, when the beer is a stout, the model becomes beta 0 plus beta 2 plus beta one ab b plus epsilon, and so you can see here the only thing that changes between the 2 beer types is the intercept and so that's exactly what we're seeing here

00:39:25.000 --> 00:39:32.000
we're just seeing that the model saying, Ok, Ipas have a higher intercept than Stouts.

00:39:32.000 --> 00:39:38.000
So if we also want a model that changes slope, we have to include what's known as an interactaction term.

00:39:38.000 --> 00:39:45.000
And so the interaction term is just saying we're gonna have a plus beta 2 for stout that's what we had before.

00:39:45.000 --> 00:39:57.000
But now the interaction is going to be between Abv and stout, meaning that you're going to multiply your Abv column by your stout column, and now include that. So that's what we mean by interactaction.

00:39:57.000 --> 00:40:05.000
Term it's just the multiplication of 2 of your features, and so we can see what happens when stout is 0 versus when stout is one.

00:40:05.000 --> 00:40:10.000
So when stout is 0, we're left with Beta 0 plus beta one ab B.

00:40:10.000 --> 00:40:18.000
But now, when stout is one, we have beta 0 plus beta 2 plus beta one plus beta 3 times abv.

00:40:18.000 --> 00:40:20.000
So this model has the potential to both have a different intercept for the stouts and a different slope for the Stouts.

00:40:20.000 --> 00:40:31.000
So let's go ahead and build this model. And then visualize the fit.

00:40:31.000 --> 00:40:34.000
So the first thing we have to do is make the interaction term.

00:40:34.000 --> 00:40:43.000
So beer train at abv times. Beer train at Stout.

00:40:43.000 --> 00:40:44.000
I went. Just so. Everyone's clear when I say at at it means I just want that particular column.

00:40:44.000 --> 00:40:55.000
So that's what I mean when I say like at, I just want the Apv Column.

00:40:55.000 --> 00:41:00.000
Okay. So now I'm gonna fit this model that I wrote down here so Abbrevi, Stout.

00:41:00.000 --> 00:41:05.000
And then the interactions. So in my fit in my features, I've got abv stout.

00:41:05.000 --> 00:41:10.000
And then the interaction between Abv and Stout.

00:41:10.000 --> 00:41:21.000
And now here I'm going to plot the model fit so you can see now that I have a slightly sharper or a slightly greater slope, for Ipas.

00:41:21.000 --> 00:41:22.000
Then I do stouts, so I see we have a question.

00:41:22.000 --> 00:41:35.000
Hi, Matt! Please correct me if I'm wrong. But is the basic idea here that if you have K categories, the points of one hot encoding is that you basically have K different models.

00:41:35.000 --> 00:41:36.000
And it fits each MoD model on the data in that category.

00:41:36.000 --> 00:41:51.000
Yes. So that is the basic idea is these indicator variables, like stout, are essentially allowing the model to fit 2 models at the same time.

00:41:51.000 --> 00:41:57.000
So it's fitting the model for Ipas here, and then this would be fitting the model for Stouts. Here.

00:41:57.000 --> 00:42:01.000
So that's the whole idea behind indicators. And then remember the reason we're choosing K.

00:42:01.000 --> 00:42:10.000
Minus one indicators here is because there's always going to be sort of like a default comparison category when all of the indicators are equal to 0.

00:42:10.000 --> 00:42:23.000
So I'd be I, Ipas, get absorbed into this model, whereas this is the adjustment made, for if we have a stout.

00:42:23.000 --> 00:42:26.000
Are there any other questions for the interaction term stuff?

00:42:26.000 --> 00:42:29.000
I have a question.

00:42:29.000 --> 00:42:30.000
Yeah.

00:42:30.000 --> 00:42:33.000
So in this particular example there are 2 types of beers, right?

00:42:33.000 --> 00:42:36.000
There's the Ipa and the stout right.

00:42:36.000 --> 00:42:37.000
Yup!

00:42:37.000 --> 00:42:46.000
Suppose that there was a third kind of beer, and I wanted to augment the model to account for that third, that third kind of beer am I correct to assert that I would need say 2 more terms. Say, Beta.

00:42:46.000 --> 00:42:51.000
4 times the third beer type, and then Beta 5 times A, B V.

00:42:51.000 --> 00:42:53.000
Times, the third, beer.

00:42:53.000 --> 00:43:01.000
So you would still have the same number of models, though the difference would be that let's say the third type was like a porter.

00:43:01.000 --> 00:43:02.000
Okay.

00:43:02.000 --> 00:43:09.000
You'd have Beta 4 times porter plus beta 4 times abv times porter so like when you can.

00:43:09.000 --> 00:43:10.000
8 to 5, right? Yeah.

00:43:10.000 --> 00:43:21.000
Yeah, so you can't include just a I don't remember if this is what you said. But like, you can't include just a single indicator, you have to include all the indicators at once.

00:43:21.000 --> 00:43:30.000
Okay. Alright. Hmm, so if I have a model that needs to encompass, let's say, N, types of beers, right?

00:43:30.000 --> 00:43:32.000
Yup! Yup!

00:43:32.000 --> 00:43:36.000
Then I would need to include how many extra terms would that be?

00:43:36.000 --> 00:43:39.000
2 times N. Minus one extra terms.

00:43:39.000 --> 00:43:43.000
Yeah. So if you are, gonna include the interaction? Yes.

00:43:43.000 --> 00:43:45.000
Okay.

00:43:45.000 --> 00:43:52.000
Yeah, and so you don't always include interactions. So sometimes there is no interaction in the slopes appear to be the same. In that case.

00:43:52.000 --> 00:43:53.000
Hmm!

00:43:53.000 --> 00:43:56.000
You just include, like the indicator.

00:43:56.000 --> 00:43:59.000
Hmm gotcha gotcha! Thanks.

00:43:59.000 --> 00:44:00.000
Yup, yeah, so Aziz is asking, why did we choose the interaction to be?

00:44:00.000 --> 00:44:13.000
Avv times stout. Could it be stout, divided by Abv.

00:44:13.000 --> 00:44:29.000
So, the reason we chose the multiplication is because the thing we're regressing ibu on is Abb. So that's why it's stout times the Abv and the one that you're wondering like why did we not choose stout divided by abv that would

00:44:29.000 --> 00:44:45.000
be like, we're also trying to regret Ibu on one over Abv, so we have to stay consistent with the thing we're actually trying to regress onto.

00:44:45.000 --> 00:44:47.000
Excuse me.

00:44:47.000 --> 00:44:49.000
Yeah.

00:44:49.000 --> 00:44:51.000
I'm not sure if that Mike picked up my burp, I forgot to me my mic.

00:44:51.000 --> 00:44:52.000
Oh, okay. No worries, no worries.

00:44:52.000 --> 00:44:59.000
I'm sorry about that. Okay.

00:44:59.000 --> 00:45:07.000
Alright, so let's now imagine we're back in like predictive modeling world, just to get some more practice with like things like K-fold. And that sort of thing.

00:45:07.000 --> 00:45:11.000
So we have 4 Mods, we can review them. We have the baseline.

00:45:11.000 --> 00:45:23.000
We have simple linear regression. We have the one with just stout as an indicator, and then we have the one that has both the indicator and the the interaction term.

00:45:23.000 --> 00:45:29.000
Okay, so we're gonna go ahead and do this with the With cross validation.

00:45:29.000 --> 00:45:31.000
So we've got importing that and then importing the mean, squared error then I'm going to make my K.

00:45:31.000 --> 00:45:51.000
Fold split objects. Here are my mse's, which I'm just gonna store in this empty array, so it's typical that you might do something like you make an array of zeros and then fill in the values as you go through here I'm going through the

00:45:51.000 --> 00:45:56.000
k fold splits, and getting my train set, and then the hold out set, or other people have mentioned the leave out, set.

00:45:56.000 --> 00:46:09.000
That's what I'm doing here. Here. I'm getting the baseline fit, so I'm getting the average value of Ibu on the training part of the cross-alidation here.

00:46:09.000 --> 00:46:13.000
I'm fitting this simple linear regression model and getting the prediction here.

00:46:13.000 --> 00:46:25.000
I'm getting the model with just stout here. I'm getting the model with the both stout and the interaction term, and here I'm recording the mean squared error for all 4 models.

00:46:25.000 --> 00:46:26.000
Okay, so when I run this, I can look at the average cross validation. Msc.

00:46:26.000 --> 00:46:36.000
For all 4 models. So it looks like the one with the Stouts and the one with the interaction term.

00:46:36.000 --> 00:46:37.000
Those are the 2 that perform best, and they're pretty close to one another.

00:46:37.000 --> 00:46:39.000
But because the interaction term is slightly lower than the one with just out this way close to one another, but because the interaction term is slightly lower than the one with just out this way.

00:46:39.000 --> 00:46:58.000
If I were to end now, this would be the one that I would choose the one with the end.

00:46:58.000 --> 00:46:59.000
Okay. Are there any other?

00:46:59.000 --> 00:47:06.000
Matt, I have one question that's actually about the coding structure, and not really about regression.

00:47:06.000 --> 00:47:11.000
Stay right where you are. I noticed, and I noticed this yesterday in the for Loop.

00:47:11.000 --> 00:47:16.000
It says, for train index. Ted Test index in your Kfold.

00:47:16.000 --> 00:47:29.000
Split nowhere there is there the I. But then, for some reason, the for loop is okay with you, using I down there in the Msc.

00:47:29.000 --> 00:47:31.000
Yeah, so I define, I, right here. And yeah, yeah.

00:47:31.000 --> 00:47:34.000
Oh, right? Okay. So it's not part of the loop.

00:47:34.000 --> 00:47:37.000
It's your own counter that you put in. Yeah.

00:47:37.000 --> 00:47:55.000
Yup Yup. So I do this, cause I think it, you know, in this, except for in this one case I think it's made it makes it more clear, for, like people who are learning for the first time a lot of other people might do something like like enumerate, and then like you can

00:47:55.000 --> 00:47:56.000
Oh, right? Okay.

00:47:56.000 --> 00:48:01.000
acquire. I, as part of the loop as well. But I think yeah, so you could do something like that.

00:48:01.000 --> 00:48:02.000
Separated out.

00:48:02.000 --> 00:48:05.000
But I like to keep it. Yeah, like what I'm just when I'm teaching. I like to keep it separate.

00:48:05.000 --> 00:48:09.000
Totally makes sense thanks.

00:48:09.000 --> 00:48:18.000
Yeah. Are there any other questions?

00:48:18.000 --> 00:48:20.000
I have one. So the mean squared error here seems to have a pretty high magnitude.

00:48:20.000 --> 00:48:30.000
Does that mean? Does that say anything about the model? Because it kind of tell?

00:48:30.000 --> 00:48:37.000
I mean it. 10 to tell us that the errors are pretty big.

00:48:37.000 --> 00:48:42.000
Yeah, so if you look at the scale of Ibu, so like the Msc.

00:48:42.000 --> 00:48:48.000
Remember, is a square of the scale. So if you took the square root of this, it would be in the like.

00:48:48.000 --> 00:48:55.000
The tens. So like typically like people will if they wanna try and interpret the Msc.

00:48:55.000 --> 00:48:58.000
They'll take the square root of it and look at the root.

00:48:58.000 --> 00:49:03.000
Msc, and that's because that's on the same scale of units as the Ibu.

00:49:03.000 --> 00:49:16.000
So if you take the root of that, you'll be on the tens, which seems to be in line with the you'll be on the tens, which seems to be in line with the you know the range of Ibus.

00:49:16.000 --> 00:49:17.000
Yeah.

00:49:17.000 --> 00:49:18.000
Makes sense. Thank you.

00:49:18.000 --> 00:49:27.000
I've seen that it's standard practice to normalize all your variables to be between 0.

00:49:27.000 --> 00:49:28.000
Yes.

00:49:28.000 --> 00:49:30.000
One is, that is, that like always advised or.

00:49:30.000 --> 00:49:46.000
So you can do that, and when you ever go regression, like regular or straight linear regression, it doesn't make like a huge difference like your coefficients will just be shifted to account for the scale and other models like we're gonna learn a model on Monday where you have to

00:49:46.000 --> 00:49:49.000
do it. Otherwise you're gonna mess up the model. So like, and other like machine learning models, I think like a lot of them other than linear regression.

00:49:49.000 --> 00:50:10.000
You want to do that cause. Otherwise it just really messes up the model.

00:50:10.000 --> 00:50:13.000
Okay.

00:50:13.000 --> 00:50:21.000
Alright, so we've got 2 more notebooks to get through, and I think we should be able to, based on my memory of what I cover in those notebooks.

00:50:21.000 --> 00:50:22.000
So the for, yeah.

00:50:22.000 --> 00:50:38.000
Actually quick question. So because of this normalization scheme, so if you don't, do normalizations, I think that inverse matchsticks a pseudoinverse X transpose X that inverse, that you discussed I think a while ago, in computing the

00:50:38.000 --> 00:50:43.000
coefficient. I think that elements of that matrix would be pretty off like different elements, will have different magnitude.

00:50:43.000 --> 00:50:50.000
Right if you don't normalize.

00:50:50.000 --> 00:50:51.000
Yeah. So I think that could I mean, you know, that could happen here?

00:50:51.000 --> 00:51:01.000
It didn't cause, I think, abv and well, it just it didn't happen.

00:51:01.000 --> 00:51:09.000
It can happen. But I also like Sk, learn in the background, doesn't actually use the normal equations it uses gradient descent.

00:51:09.000 --> 00:51:11.000
So it doesn't compute an inverse.

00:51:11.000 --> 00:51:18.000
No, but you did it, I mean brute force way also, like heroically so.

00:51:18.000 --> 00:51:21.000
Yeah. Yeah. Yeah. Yes.

00:51:21.000 --> 00:51:36.000
All I'm trying to say here is that if the magnitude difference, Microsoft might probably make sense to do the normalizations, I mean, I, at least for not for 3 3 3 problems.

00:51:36.000 --> 00:51:40.000
But this type of transformations. But I think you are.

00:51:40.000 --> 00:51:47.000
You are convenient, different. Different. Port.

00:51:47.000 --> 00:51:51.000
So can you elaborate that 1? One more time?

00:51:51.000 --> 00:51:56.000
So like!

00:51:56.000 --> 00:52:06.000
You could have. I forget what's the name of like of the term is for the matrices but they can be be behaved, if you know, like you're saying.

00:52:06.000 --> 00:52:07.000
That's right.

00:52:07.000 --> 00:52:15.000
But we're not, you know, computing the what, even though, like I gave, like the background of like classically, this is how you solve a linear regression.

00:52:15.000 --> 00:52:19.000
It turns out, sk learn doesn't do the normal equations.

00:52:19.000 --> 00:52:20.000
It does. Gradient descent, so it doesn't compute the inverse.

00:52:20.000 --> 00:52:32.000
So that's not as much of a problem so I don't typically see people do scaling with lim linear regression, but with other models, we will talk about scaling.

00:52:32.000 --> 00:52:37.000
And you know the last notebook we're going to go over today is about scaling, because next week's models do use scaling quite a bit.

00:52:37.000 --> 00:52:41.000
It can be an issue like you said I wasn't.

00:52:41.000 --> 00:52:44.000
It wasn't an issue today.

00:52:44.000 --> 00:52:48.000
Okay. Okay.

00:52:48.000 --> 00:52:55.000
Okay. So the next type of thing we're going to build up our regression is to learn about polynomial regression and nonlinear transformations.

00:52:55.000 --> 00:52:56.000
So for this, we're going to work this is a synthetic data set.

00:52:56.000 --> 00:53:03.000
So it's called Poly Dot Csv.

00:53:03.000 --> 00:53:04.000
I'm pretty sure it should be in the repository but now I'm slightly nervous.

00:53:04.000 --> 00:53:09.000
I forgot to upload it. So if this doesn't work for you, it's because I haven't uploaded the data.

00:53:09.000 --> 00:53:10.000
I will make sure that it's uploaded after, if it's not currently uploaded.

00:53:10.000 --> 00:53:24.000
So we have 2 inputs x one and x 2. And then the thing we're trying to predict is why.

00:53:24.000 --> 00:53:29.000
So one thing that is a nice feature of pandas, is it?

00:53:29.000 --> 00:53:33.000
Has this function called scatter matrix. And I believe on Monday you'll also learn.

00:53:33.000 --> 00:53:34.000
Or today, actually, you learned a different function called pair plot and seaborate, which does accomplishes the same thing.

00:53:34.000 --> 00:53:43.000
So with scatter, matrix.

00:53:43.000 --> 00:53:46.000
You get this nice matrix of the different scatter plots.

00:53:46.000 --> 00:53:54.000
So the horizontal axis corresponds to the rows, so here every off diagonal plot will have X one.

00:53:54.000 --> 00:54:12.000
As the vertical axis, and then the diagonal plot shows the histogram of X one, and then every column, every column will have, like x one, as the horizontal axis, and this column X 2 is the horizontal axis, and this column and then

00:54:12.000 --> 00:54:22.000
again the histogram on the diagonal, so the one that we're most interested in is this row because we want to see why, as a function of x one and x 2.

00:54:22.000 --> 00:54:28.000
And so we're going to look at this. And so we can see there does appear to be a linear relationship between Y and X 2.

00:54:28.000 --> 00:54:37.000
But there definitely seems to be some other type of relationship between Y and X, one.

00:54:37.000 --> 00:54:38.000
And so this is going to be where we might want to try and include some transformations of x one.

00:54:38.000 --> 00:54:50.000
So, for instance, it looks like it could potentially be sort of like a an even polynomial of x one.

00:54:50.000 --> 00:55:07.000
And so we're going to first start off with learning about polynomial regression, where you add in these sorts of terms, and then we'll branch into nonlinear transformations before we talk about that, though I have a question some Zach is asking is scatter matrix preferred over

00:55:07.000 --> 00:55:16.000
a corner plot just because you can see the transpose I don't think that there's a preference like this is just a one I don't think that there's a preference like this is just the one I whenever I wrote this notebook.

00:55:16.000 --> 00:55:18.000
Like 2 years ago. This is just the one I chose to use.

00:55:18.000 --> 00:55:26.000
I don' and think that people are like there might be die hard people on. Okay, you have to use scatter matrix versus corner plot, or vice versa.

00:55:26.000 --> 00:55:32.000
I'm not one of those people. This is just like what I knew when I wrote the notebook.

00:55:32.000 --> 00:55:35.000
Okay, so we want to look at the relationship like between Y and an even power of X, one.

00:55:35.000 --> 00:55:46.000
And so we're gonna go ahead and make that as one of the columns in our data frame.

00:55:46.000 --> 00:55:53.000
So we're gonna do df, at x one. And then we're gonna square. It.

00:55:53.000 --> 00:56:00.000
And then we're going to look at the scatter matrix, go down to the row for Y.

00:56:00.000 --> 00:56:04.000
And then we can see here, you know. Still, X, one doesn't look linear.

00:56:04.000 --> 00:56:06.000
But if we look at the relationship with Y and x one squared, now, this looks closer to being a linear relationship.

00:56:06.000 --> 00:56:16.000
Okay. So now, we've got that thought of adding the following model.

00:56:16.000 --> 00:56:23.000
So we said that there does appear to be a relationship between Y and X 2.

00:56:23.000 --> 00:56:29.000
Linear relationship, and there also appears to be a linear relationship between Y and X, one squared.

00:56:29.000 --> 00:56:31.000
And so for this model, we're going to regress Y on Beta 0 or sorry y on x one x one squared and x 2.

00:56:31.000 --> 00:56:41.000
So we'll talk about in a second. Why, we want to also include X, one.

00:56:41.000 --> 00:56:42.000
But for now let's just take it that we want to include X, one.

00:56:42.000 --> 00:56:48.000
But for now, like, just take it that we want to include X one. And then I'll explain why we want to include X, one.

00:56:48.000 --> 00:56:50.000
And then I'll explain why we want to include it later in the notebook.

00:56:50.000 --> 00:56:51.000
Okay, so I'm just importing my linear regression.

00:56:51.000 --> 00:57:01.000
And then fitting the model here and so we're gonna also gonna look at a little trick and we'll dive deeper into this trick in a later notebook next week.

00:57:01.000 --> 00:57:12.000
But one thing that you might try and do when you're building these linear regression models is, look at something called the residual plot.

00:57:12.000 --> 00:57:16.000
And so a residual plot is where you take your errors also known as residuals, which is the actual micro.

00:57:16.000 --> 00:57:20.000
The predicted, and then you plot that against your actual values.

00:57:20.000 --> 00:57:34.000
So, if our model is pretty good, then we would hope that our residuals are close to our errors.

00:57:34.000 --> 00:57:53.000
Remember, from the theoretical model. So what that means is, if our model is closely approximating why, we would expect our residuals to sort of look like an evenly spaced, like uniformly spaced a blob of plots along the the I mean not uniformly but a nicely

00:57:53.000 --> 00:58:01.000
evenly spaced blob of plots around the horizontal axis so here's where I'm going to plot that residual plot, and I forgot to make this smaller.

00:58:01.000 --> 00:58:12.000
So let's do 8 comma 4. So you can see here.

00:58:12.000 --> 00:58:18.000
That the residuals. So let's make this smaller to oh, no, okay.

00:58:18.000 --> 00:58:34.000
So you can see here that there residuals are definitely not like an even band, and so like what we would expect is if our model is capturing all the information from the inputs that we would expect to see sort of an even band, because there's in our assumption the residuals are

00:58:34.000 --> 00:58:40.000
normally distributed. So if our models, getting closer to approximating those random errors, we would.

00:58:40.000 --> 00:58:43.000
You know, we're assuming that that's normally distributed like here.

00:58:43.000 --> 00:58:52.000
This is very clearly showing us, like a clear pattern in the data as we change values of why, there's like sort of this weird shape going on.

00:58:52.000 --> 00:58:54.000
So again, we'll dive more deeply into like.

00:58:54.000 --> 00:59:00.000
Why this sort of thing happens in a later notebook, but this is always an indicator that you're missing.

00:59:00.000 --> 00:59:16.000
Some signal from the variables that you have, that you might want to try and include, and so typical when you'll see like, you know, like a tip is when you see a shape kind of like this, it's almost like a twisty wish bony type shape, that usually is an indicator that you

00:59:16.000 --> 00:59:17.000
might want to try, including an interactaction term between the columns.

00:59:17.000 --> 00:59:28.000
So last notebook, we learned that interaction terms is like an indicator times a continuous, variable.

00:59:28.000 --> 00:59:34.000
We also call any like any multiplication of 2 columns, is called an interaction term.

00:59:34.000 --> 00:59:36.000
So when you see a residual plot like this, and again we'll dive more into residual plots next week.

00:59:36.000 --> 00:59:43.000
It's an indicator. That you're missing.

00:59:43.000 --> 00:59:51.000
In some sort of interaction term or a nonlinear transformation, or something so we're gonna go ahead and try and refit the model.

00:59:51.000 --> 00:59:58.000
But now, including this last interaction term between x. One and x 2, so we're gonna go ahead and do that.

00:59:58.000 --> 01:00:01.000
So df. At x. One times df, at x.

01:00:01.000 --> 01:00:12.000
2, and so, while I'm fitting the model, I'll also say you might be confused, because I didn't do a train test split here again.

01:00:12.000 --> 01:00:21.000
This one is just sort of, for like demonstration purposes of, you know the model and the transformations.

01:00:21.000 --> 01:00:33.000
This is not real data. It's since that. So if I wanted to, I could always go out into the world and produce new data I'm not in like predictive model trying to find the best possible model for any or anything like that.

01:00:33.000 --> 01:00:34.000
It's just for instructive purposes. So that's why I did not make a trained test Split.

01:00:34.000 --> 01:00:42.000
Okay, so we've refit this model. We fit this model.

01:00:42.000 --> 01:00:43.000
Now we're gonna remake our residual plots.

01:00:43.000 --> 01:00:50.000
But I'm gonna make those edits that you saw me make earlier, just real quick.

01:00:50.000 --> 01:00:55.000
There we go, and so now you can see it's sort of like this nice band between the values of negative 2 and 2. And this is like what you're looking for with your residual plots.

01:00:55.000 --> 01:01:11.000
And again we'll talk. This was sort of just like a preview of something we'll dive into a little bit more deeper and we'll talk. This was sort of just like a preview of something. We'll dive into a little bit more deeper next week.

01:01:11.000 --> 01:01:15.000
So this tends to indicate that you've captured a decent amount of the signal in.

01:01:15.000 --> 01:01:21.000
So like, I said, maybe this seems like a mystical process of like, how are you supposed to know?

01:01:21.000 --> 01:01:24.000
We'll talk about this in much more detail next week.

01:01:24.000 --> 01:01:32.000
Okay. So I also promised I would tell you why, if the relationship that seems to be linear is X squared, why are we also including X, one?

01:01:32.000 --> 01:01:42.000
So this is about something called respecting the hierarchy, and so, if we look at the coefficients, you might also look this is the coefficient on X one.

01:01:42.000 --> 01:01:50.000
This one. I'm highlighting right here. This is relatively close to 0.

01:01:50.000 --> 01:02:01.000
So you might be thinking, well, I should probably just try and remove it, because it's already pretty much 0, anyway, and just fit this model that I've provided here like, maybe this is the functional form.

01:02:01.000 --> 01:02:17.000
So you don't wanna do something like that. So when you're building polynomial regression models or regression models with interaction terms, you have to include all of the lower powers of of the of the variable as well, so if you want to include a model that as

01:02:17.000 --> 01:02:22.000
x one squared in it. You also have to include the variable X one.

01:02:22.000 --> 01:02:28.000
If you were to only include X, one squared, you'd be fitting a a parameter of the form.

01:02:28.000 --> 01:02:38.000
Beta 0 plus beta one x one squared. And you're limiting the flexibility of your model to only consider parabolas of this form.

01:02:38.000 --> 01:02:41.000
So by including the X one, you're able to fit like the whole range of parabolas on the real numbers.

01:02:41.000 --> 01:02:44.000
And so that's why you need to include all the lower power.

01:02:44.000 --> 01:02:55.000
So. For instance, if you had something that you were including X, one cubed, you'd also need to include X one squared and x one.

01:02:55.000 --> 01:02:56.000
If you had x, one to the fourth, you need to include X cubed, X squared, and x as well.

01:02:56.000 --> 01:03:08.000
So if you're fitting upon polynomial of the nth degree, you need to include all the lower degree terms as well.

01:03:08.000 --> 01:03:21.000
So similarly for interaction terms. If you include X, one times x, 2, you need to include both x one and x 2 as their own predictors as well.

01:03:21.000 --> 01:03:30.000
And then, just as a final thing before I open it up for questions, the same thing like just like we made, you know, polynomial transitions.

01:03:30.000 --> 01:03:33.000
And if you did the problem session today, you saw that you can do this as well.

01:03:33.000 --> 01:03:39.000
You can. You can just make nonlinear transformations of the columns so so you can take square roots.

01:03:39.000 --> 01:03:44.000
You can take logs, you can take signs. You can take E to the column.

01:03:44.000 --> 01:03:50.000
You can do any nonlinear transformation you'd like to and then include it in whatever model you'd like.

01:03:50.000 --> 01:03:51.000
So that's another step in the process of sort of you know.

01:03:51.000 --> 01:03:58.000
Looking at these different plots, and then seeing which ones maybe have linear relationships.

01:03:58.000 --> 01:04:02.000
Because again, we're doing a linear regression model.

01:04:02.000 --> 01:04:03.000
Okay. So now I'll open it up for questions before we end up.

01:04:03.000 --> 01:04:12.000
End. The polynomial regression notebook.

01:04:12.000 --> 01:04:15.000
Hi! I have a question.

01:04:15.000 --> 01:04:16.000
Yeah.

01:04:16.000 --> 01:04:17.000
It seems like this can get really complicated. Really quick.

01:04:17.000 --> 01:04:33.000
If you have like, lots of features so like in like a real world situation, you might have like 10 features, and then you might see that, like some of them are, you know, maybe quadratic looking.

01:04:33.000 --> 01:04:36.000
Some of them are exponential. So you have to include all those.

01:04:36.000 --> 01:04:46.000
And then you have to include all the interactions terms like, is there any advice for like doing this like in using real data where you might have lots of interaction terms?

01:04:46.000 --> 01:04:53.000
Or would you just scrap this whole linear thing? And maybe, you know, do something else.

01:04:53.000 --> 01:04:55.000
Yeah. So when you no, no, no, no, no, it's fine, that's fine.

01:04:55.000 --> 01:04:58.000
Sorry. It's kind of a vague general question, but.

01:04:58.000 --> 01:05:05.000
So like, if you have a data set that has lots of features, it would be difficult to.

01:05:05.000 --> 01:05:17.000
It can be difficult to do this sort of systematic process, like of plotting like, for instance, with like 10, it's maybe somewhat manageable to make like sort of a scatter matrix like this.

01:05:17.000 --> 01:05:20.000
But like you're saying, it could very quickly get out of hand and other problems.

01:05:20.000 --> 01:05:28.000
You'll have like more than 10. So in those sorts of situations you can't always do this sort of thing where you're doing like plotting or you're doing like going, you know, step by step, like this.

01:05:28.000 --> 01:05:37.000
So in those sorts of cases you could still consider linear regression models.

01:05:37.000 --> 01:05:45.000
But you might have to try more algorithmic approaches to selecting features which I know I said earlier, we are going to see more algorithmic approaches to selecting features.

01:05:45.000 --> 01:05:55.000
Next week. You may also consider other model types entirely so we don't explicitly cover the regression versions of these.

01:05:55.000 --> 01:06:01.000
But we do talk about like random. Forests and support vendor machines and K, nearest neighbors.

01:06:01.000 --> 01:06:06.000
Type stuff when we do classification and they have regression counterpoints.

01:06:06.000 --> 01:06:10.000
So you might just try different models. You might try more algorithmic feature.

01:06:10.000 --> 01:06:21.000
Selection approaches. If it's like, if it's reasonably manageable, you might just do the time, take the time to look at the different plots and then see like, okay, if, like, like, you said in your example, this seems to be quadratic.

01:06:21.000 --> 01:06:43.000
Then you might include the quadratic term and then do a you might include the quadratic term and then do a different. You can like just sort of do a cross validation, because linear regressions tend to be relatively quick quick to fit in comparison. To other models.

01:06:43.000 --> 01:06:48.000
And so Ky thoughts asking principal component analysis is another approach.

01:06:48.000 --> 01:06:54.000
So Princeton component analysis isn't necessarily showing you which features are the most important.

01:06:54.000 --> 01:07:14.000
For as predictors, or give you the best predictions, but it can be a pre-processing step for linear regression models are the most important, for as predictors, or give you the best predictions, but it can be a pre-processing step for linear regression models in particular linear regression models don't perform well, if the columns are highly correlated

01:07:14.000 --> 01:07:15.000
with one another. So this is because you can essentially rewrite one column as a linear combination of the others, or at least get close to it.

01:07:15.000 --> 01:07:26.000
So the regression fit performance badly. In those cases you might prefers badly in those cases. You might perform. Pca.

01:07:26.000 --> 01:07:27.000
First, to get a set of perpendicular predictors.

01:07:27.000 --> 01:07:28.000
We'll talk about Pca. First to get a set of perpendicular predictors.

01:07:28.000 --> 01:07:32.000
We'll talk about Pca. More next week. It's not really.

01:07:32.000 --> 01:07:37.000
It's not really something that's done in terms of like figuring out which features are most important for predicting.

01:07:37.000 --> 01:07:43.000
Why?

01:07:43.000 --> 01:07:46.000
You know, like I always thought Pca was just used.

01:07:46.000 --> 01:07:52.000
If you have like a lot a lot of features like more feature than you have rows.

01:07:52.000 --> 01:07:53.000
Yeah, so.

01:07:53.000 --> 01:08:01.000
But I was more just. I was more just saying, like like this model seems complicated, even if you have like 3 features like.

01:08:01.000 --> 01:08:02.000
Cause. You have to consider interactions with them and everything like you can get comfortable.

01:08:02.000 --> 01:08:12.000
Yeah, so it can get complicated. Yeah, yeah. Yup. So Pca can also be used to sort of compress the data as well.

01:08:12.000 --> 01:08:17.000
Yes.

01:08:17.000 --> 01:08:18.000
Great.

01:08:18.000 --> 01:08:25.000
I just had a question about the first part of this exercise where you square one of the variables.

01:08:25.000 --> 01:08:26.000
Yeah.

01:08:26.000 --> 01:08:30.000
Why did what were you looking for when you squared it? I didn't quite follow.

01:08:30.000 --> 01:08:34.000
So remember. So we're doing linear regression in order for linear regression to be a good model.

01:08:34.000 --> 01:08:48.000
It has to be a linear relationship. So if we look at this first plot, the relationship between Y and X one here is clearly not linear, but it does look like sort of like an even polynomial right.

01:08:48.000 --> 01:09:06.000
So even polynomials tend to both. Go go up in both directions and sort of kind of curve like this. And so it might be reasonable to try something like an X one squared to see if that helps. And so that's why we tried X one. Squared.

01:09:06.000 --> 01:09:13.000
Okay. Thanks.

01:09:13.000 --> 01:09:30.000
Okay, so we're gonna leave the world of regression for the rest of today and go into the world of data pre-processing and talk about scaling data and then depending on how long this takes us, we may also start something called pipelines we'll just have to see where we

01:09:30.000 --> 01:09:43.000
are at the end of this notebook. Okay, once my kernel starts.

01:09:43.000 --> 01:09:48.000
Awesome. Okay? So we're going to learn about things that scalars.

01:09:48.000 --> 01:09:49.000
In particular, we'll focus mainly on standard scalar.

01:09:49.000 --> 01:10:05.000
But this you know, the processes we learn in this notebook will apply for every single scalar object in S scale, and then I also see I missed a question from Lara.

01:10:05.000 --> 01:10:08.000
Can you do a different transformation for each variable Yup?

01:10:08.000 --> 01:10:09.000
So it could be the case that you want to do something like make a square for X one.

01:10:09.000 --> 01:10:20.000
Do a log transform for X 2, just like well, I guess it's not necessarily different from today's problem session.

01:10:20.000 --> 01:10:24.000
But in today's problem session, you did log transforms of, I guess, one of the features.

01:10:24.000 --> 01:10:28.000
And then the thing you're providing. But you can. You can.

01:10:28.000 --> 01:10:31.000
It doesn't have to be the same transation for every single variable.

01:10:31.000 --> 01:10:39.000
They can be different transformations, depending on what you're seeing in the data.

01:10:39.000 --> 01:10:45.000
Okay, so back to Scalar. So we're gonna pretend that we have some data here.

01:10:45.000 --> 01:10:50.000
So it's just gonna be a series of different, randomly generated variables.

01:10:50.000 --> 01:10:56.000
And then, if you look at the way these variables are generated, you'll see that they have very different scales.

01:10:56.000 --> 01:11:00.000
And so what do I mean when we say the scales of the data?

01:11:00.000 --> 01:11:01.000
So we just mean, like the powers of 10, basically.

01:11:01.000 --> 01:11:08.000
So, for instance, this one x. One, the first variable.

01:11:08.000 --> 01:11:18.000
If you look at the variance, it goes as a very high variance so, and you know, if you look at the way it's generated, it's like in the 1,000 the next.

01:11:18.000 --> 01:11:21.000
This one is sort of looking at the variance in the mean.

01:11:21.000 --> 01:11:25.000
In the ones, and and just 10 like the ones in 10.

01:11:25.000 --> 01:11:29.000
Here you have something that's also, and like the the thousands or tens of thousands.

01:11:29.000 --> 01:11:32.000
And then here you have something that's on the scale of like the hundreds.

01:11:32.000 --> 01:11:33.000
And so sort of like. We mentioned earlier some of your models.

01:11:33.000 --> 01:11:40.000
Will struggle if you have vastly different scales.

01:11:40.000 --> 01:11:46.000
So up to this point, we're Regression has been relatively well behaved with the models we fit.

01:11:46.000 --> 01:11:51.000
But in general machine learning models and data science models can behave poorly.

01:11:51.000 --> 01:11:53.000
If one of your columns has a vastly different scale from the other column.

01:11:53.000 --> 01:11:56.000
So imagine a column that's in, you know the tenths versus a column that's in the 1 million.

01:11:56.000 --> 01:12:09.000
So I like that sort of thing. So typically what you'll do as a what's called a pre-processing or cleaning step is, you will scale all of your data so that the different columns are operating on the same scale.

01:12:09.000 --> 01:12:31.000
So one way to do this is standardization, which is slightly different from normalization, that we talked about earlier with some of our questions which is slightly different from normalization, that we talked about earlier with some of our questions. So in standardization, you're doing something called standardizing, it which means you're going to do the following so you take your

01:12:31.000 --> 01:12:49.000
variable X. Maybe this represents a column. You take your variable X, and then you do all the observations minus the arithmetic mean of that column divided by the standard deviation of that column, and so, if you've taken a frequentist statistics.

01:12:49.000 --> 01:12:51.000
Course, or use something called a statistics course, or use something called a Z table.

01:12:51.000 --> 01:12:58.000
This should look familiar. So this is the exact transformation applied to turn any arbitrary, normal, random, variable into what's known as a standard, normal, random, variable meaning.

01:12:58.000 --> 01:12:59.000
It's a normal, random, variable, meaning. It's a normal, random, variable, with mean 0 and standard deviation.

01:12:59.000 --> 01:13:07.000
One. So this is why this process is called standardizing, normal, random, variable with mean 0 and standard deviation one. So this is why this process is called standardizing so when you standardize a column, all of that column that call will then have mean 0 and standard deviation.

01:13:07.000 --> 01:13:25.000
So when you standardize a column, all that column that call will then have means 0 and standard deviation. One.

01:13:25.000 --> 01:13:29.000
And so this way you could take all 4 of these columns and then put it on the same scale from 0 to one.

01:13:29.000 --> 01:13:30.000
So how do you do this? In Python? You have to use something called the standard scalar object.

01:13:30.000 --> 01:13:39.000
So this is stored in the pre-processing sub package of of Sk.

01:13:39.000 --> 01:13:48.000
Learn. So we're gonna see a lot of tools from pre-processing over the next couple weeks.

01:13:48.000 --> 01:13:55.000
So one of them is standard scaleer. Okay, so this is known as a scalar object.

01:13:55.000 --> 01:14:02.000
There's multiple scalar objects. So if we go down here, see where it did it go?

01:14:02.000 --> 01:14:06.000
I think we still want to be in pre-processing.

01:14:06.000 --> 01:14:09.000
Here we go. So we've got like Max Apps scalar min.

01:14:09.000 --> 01:14:21.000
Max Scalar normalizer a standard scalar, robust scalarer, robust scaler. So we're going to focus and mostly just use standard scalar.

01:14:21.000 --> 01:14:27.000
But it might be, you know, interest. You might be interested in looking to see what these different scalars do by checking out the documentation.

01:14:27.000 --> 01:14:31.000
So what's show you how to use the standard scalar and python?

01:14:31.000 --> 01:14:44.000
The first thing we need to do is import it so from sqlern.be processing, we're going to import standard scalar, and then also I wanna make a quick note.

01:14:44.000 --> 01:14:48.000
I noticed in some of the problem session work groups earlier today.

01:14:48.000 --> 01:14:51.000
Maybe some of us are newer to Jupiter notebooks than others.

01:14:51.000 --> 01:14:58.000
So one thing that's really easy, instead of having to like go up here and click, run, or some people have the run button over here.

01:14:58.000 --> 01:15:02.000
If you just click, shift and enter at the same time it runs the code.

01:15:02.000 --> 01:15:07.000
Another thing that's a nice feature is you can try sort of auto complete.

01:15:07.000 --> 01:15:19.000
So you might have noticed when I started from Sklearn dot-complete. So you might have noticed when I started from S. Learn dot pre then like this showed up.

01:15:19.000 --> 01:15:22.000
So this showed up because I hit the tab button, and so by hitting the tab button.

01:15:22.000 --> 01:15:27.000
It shows you like using. Its sort of memory and what it knows about the package like here are the things that you might be trying to type out.

01:15:27.000 --> 01:15:30.000
So if you click on it, it will then auto complete for you.

01:15:30.000 --> 01:15:32.000
If there's only one option, it just does the auto complete already?

01:15:32.000 --> 01:15:35.000
So this can be a nice feature that cuts down on your coding.

01:15:35.000 --> 01:15:41.000
Okay, so aside a Jupiter notebook aside over back to standard scalar.

01:15:41.000 --> 01:15:45.000
So the first thing we have to do is make a standard scalar object.

01:15:45.000 --> 01:15:54.000
So I'm going to call that scalar equals. And then you just say, Standard scalar.

01:15:54.000 --> 01:15:57.000
Then we have to do what's called fitting the scalar. So we do.

01:15:57.000 --> 01:16:00.000
Scalar got fit of X, and so remember, why are we doing X.

01:16:00.000 --> 01:16:09.000
Because that's what my data stor in. So we can imagine that this is X-ray or X test. Well, sorry.

01:16:09.000 --> 01:16:18.000
Don't not excess, but just X strain. We can imagine that this is like an X train for a model, but for simplicity of typing, I just did X okay.

01:16:18.000 --> 01:16:22.000
And so you might be wondering, what do you mean that you have to fit?

01:16:22.000 --> 01:16:29.000
So what's happening when you fit the scalar is, it's going through each of the columns of your of your array.

01:16:29.000 --> 01:16:39.000
Find the mean, and then also finding the standard deviation because it needs to know both of those things in order to perform the standardization.

01:16:39.000 --> 01:16:43.000
So that's what's happening when we call dot fit.

01:16:43.000 --> 01:16:50.000
Okay. The next thing we need to do is then scale the data and sk learns, speak syntax.

01:16:50.000 --> 01:16:55.000
This is called transfer. So we'll do scalar, dot, transform.

01:16:55.000 --> 01:17:01.000
And we input, X and then this will be stored in a new variable that we call X scale.

01:17:01.000 --> 01:17:10.000
Okay. And so now, if we look at the mean and the variance of the scale data, we can see that all of them have means that are virtually the same as 0.

01:17:10.000 --> 01:17:13.000
So computers, it's difficult for them to get to 0.

01:17:13.000 --> 01:17:19.000
But this is like close, and then all of them have variances that are virtually the same as one.

01:17:19.000 --> 01:17:25.000
Okay.

01:17:25.000 --> 01:17:26.000
Okay, so.

01:17:26.000 --> 01:17:30.000
Is this okay to do? If you have one hot encoded variables?

01:17:30.000 --> 01:17:41.000
Yeah. So if you have, this is typically only done for continuous variables.

01:17:41.000 --> 01:17:42.000
Okay.

01:17:42.000 --> 01:17:45.000
If you have one hot encoded variables, you would have to separate those off from your array, or when we learn pipelines you'd have to make a custom.

01:17:45.000 --> 01:17:49.000
Pipeline objects in order to do that.

01:17:49.000 --> 01:17:52.000
Sorry I have a question.

01:17:52.000 --> 01:17:53.000
Yeah.

01:17:53.000 --> 01:17:59.000
What will happen in the situation whereby we have the data set already, and they're in the same scal already.

01:17:59.000 --> 01:18:03.000
I see quite, and apply this.

01:18:03.000 --> 01:18:18.000
Yeah. So if all of your row, if all of your columns are on about the same scale, you theoretically wouldn't have to do the scalar, I think sometimes it would still be like good practice to do the scalar so like.

01:18:18.000 --> 01:18:32.000
I'm trying to think of a good example off the top of my head, but I think like it's it's just typically good practice use the standard scalar, for, like the models where that struggles. But if all of your columns are on about the same scale.

01:18:32.000 --> 01:18:37.000
You know theoretically, everything should be okay. But I think, like, you know, computers can struggle with like really large numbers or with really small numbers.

01:18:37.000 --> 01:18:47.000
So sometimes it can still be useful to standard scale.

01:18:47.000 --> 01:18:48.000
Yeah.

01:18:48.000 --> 01:18:51.000
Okay.

01:18:51.000 --> 01:18:58.000
Okay. So I wanna make sure this whole fit transform and fit underscore transform thing makes sense.

01:18:58.000 --> 01:19:15.000
So fit was the fitting process that meant, in this particular instance, calculating the mean and the standard deviation of every column transform was then the thing that went through and actually calculated for each column. X.

01:19:15.000 --> 01:19:22.000
Minus mean, divided by standard deviation. So you always have to call fit before you can call transform.

01:19:22.000 --> 01:19:28.000
So, for instance, if I made.

01:19:28.000 --> 01:19:37.000
Scalar. 2 is also a standard scalar. Object, and then I tried to call Scalar to dot, transform X.

01:19:37.000 --> 01:19:40.000
We'll get an error, and we'll see right here at the top.

01:19:40.000 --> 01:19:50.000
Not fitted error. So why is this fitting always has to be done.

01:19:50.000 --> 01:19:56.000
Before transforming. Okay? So let's minimize that.

01:19:56.000 --> 01:20:03.000
And so there's also this nice function that you can use for these objects called fit underscore, transform.

01:20:03.000 --> 01:20:15.000
That will do all of this at the same time. So if we tried this again, and we did scalar to dot fit underscore, transfer X.

01:20:15.000 --> 01:20:18.000
We'll see that we get out what we want. And then later, we would not have to refit it, because it's already been fit once so fit underscore trans.

01:20:18.000 --> 01:20:30.000
Transform does both fit and transform at the same time.

01:20:30.000 --> 01:20:36.000
Okay. So you might be wondering, well, if it does both at the same time, why do I need anything other than fit transform?

01:20:36.000 --> 01:20:42.000
That's should always be what I use. So whenever you're doing predictive modeling work, you need to do only fit your scalars on the training data.

01:20:42.000 --> 01:20:52.000
Never on the test data. So remember, the idea with predictive modeling is we don't.

01:20:52.000 --> 01:21:10.000
We don't know what the labels on, whatever our test set or holdout sets are, so we have to fit the scalar on whatever the training data is, and then use that fitted scalar for test or holdout data, so if we were then to maybe go back and try and refit the scalar using

01:21:10.000 --> 01:21:14.000
the test or the holdout data that would be, what's known as data leakage.

01:21:14.000 --> 01:21:25.000
Okay, so let's go ahead and show you what this looks like, and sort of a predictive modeling workflow.

01:21:25.000 --> 01:21:27.000
So I'm gonna import my train test split and then make a trained test split of X so I also will take this change.

01:21:27.000 --> 01:21:39.000
This point out. So I think I confuse people earlier. So when I introduce the train test split, I used an array here.

01:21:39.000 --> 01:21:45.000
It's also an array. But you don't always have to have both an X and a Y for trained test.

01:21:45.000 --> 01:21:48.000
Split. So if you're only to splitting one thing, it's okay to put that one thing in train test split is also built.

01:21:48.000 --> 01:21:50.000
It built to be able to take 2 things like an x and a Y, so you could do that.

01:21:50.000 --> 01:22:08.000
So, for instance, like, if I had a Y, I could put it here, but I don't have to, and so then, when I had something like the data frame that I had earlier, in the problem session, you can just put the data frame in here.

01:22:08.000 --> 01:22:12.000
And then split that into like I think it was cars, train and cars. Test. Okay?

01:22:12.000 --> 01:22:17.000
So now that asides over, let's sort of go through our workflow.

01:22:17.000 --> 01:22:25.000
So we would first make our standards scalar object.

01:22:25.000 --> 01:22:27.000
We would fit it on the training setting and then get the transformed scaled data on the training set.

01:22:27.000 --> 01:22:38.000
So I'll turn it we could think of. We could have used fit transform here, because it's just the training setting.

01:22:38.000 --> 01:22:51.000
Then we would go ahead and imagine that we build some model like a linear regression or something, and then, once we're done, we want to get this in order to get the fit or the model prediction on the test set.

01:22:51.000 --> 01:22:58.000
We would do a scalar new dot transform X test.

01:22:58.000 --> 01:23:01.000
So we would not refit the scalar on the test set.

01:23:01.000 --> 01:23:05.000
We use the fit scalar from the training data. Okay?

01:23:05.000 --> 01:23:10.000
And so like. I said earlier, there are other Scalar objects from standard scalar.

01:23:10.000 --> 01:23:24.000
You can find this in the documentation, and then I can see in the chat that some of you have already started exploring the documentation which is great, and I believe Pedro, we are going to look at polynomial features in a later notebook.

01:23:24.000 --> 01:23:32.000
So that's a great find. Okay? So if there are any, are there any questions about scaling the data?

01:23:32.000 --> 01:23:41.000
Yeah, so we transform the tests. That would be mean and standard deviation of the test set.

01:23:41.000 --> 01:23:42.000
Sorry the training set. Am I right?

01:23:42.000 --> 01:23:47.000
Yup, yeah. Yup, yup.

01:23:47.000 --> 01:24:00.000
So we can essentially do the. If I understood this correctly, we can do the fit underscore transform to the entire data set before we split it.

01:24:00.000 --> 01:24:01.000
So. That's a great question. Because you you cannot do that.

01:24:01.000 --> 01:24:04.000
Is that correct?

01:24:04.000 --> 01:24:28.000
So you can't do that because the data from your test set would then be leaking into your model fitting process because it would be encoded into the mean and the standard deviations of the scale like that were fit with the scalar so it would be encoded into the mean and the standard deviations of the scale like that were fit with the

01:24:28.000 --> 01:24:35.000
scalar so the standard scalar should be considered a part of your modeling. So, like the mean and the standard deviation you get for the scalar can only come from your training sets not from your test. Set.

01:24:35.000 --> 01:24:48.000
So if you did this before the train test split some of the data from the test set would be leaking into your model because it would be encoded in the mean and the standard deviation of the of the scalar.

01:24:48.000 --> 01:24:50.000
Cool. Thank you.

01:24:50.000 --> 01:24:53.000
Yeah, yeah, thanks for asking that question.

01:24:53.000 --> 01:24:58.000
I'm I'm just trying to recap what you explained.

01:24:58.000 --> 01:25:02.000
So you're saying that so you would. You would do the fit transform in one.

01:25:02.000 --> 01:25:05.000
Go on. The training set, correct.

01:25:05.000 --> 01:25:08.000
Yeah, so you can do that. Yup.

01:25:08.000 --> 01:25:13.000
But on the test set you would just do fit and not transform it.

01:25:13.000 --> 01:25:18.000
You would just do, transform and not fit.

01:25:18.000 --> 01:25:21.000
Okay, so yes. So you would just to transform a not fit.

01:25:21.000 --> 01:25:25.000
And then you would build the model and test it on the training.

01:25:25.000 --> 01:25:28.000
And then once you have, yeah, okay, I'm then tested on the testing.

01:25:28.000 --> 01:25:38.000
Okay, right? So you so basically, you're saying, you never do on the test set right?

01:25:38.000 --> 01:25:42.000
You shouldn't do fit on the testing.

01:25:42.000 --> 01:25:43.000
Okay.

01:25:43.000 --> 01:25:54.000
Yes, Yup, like, you know, with the caveat that like once you're done, and you've selected a model like you would refit the entire model on the entire data set that you have but that's like at the very end.

01:25:54.000 --> 01:26:02.000
Like, I have this model, and I'm happy with it. And I'm gonna put it on whatever system my team is using.

01:26:02.000 --> 01:26:03.000
What if you're doing a validation set like doing some crossover?

01:26:03.000 --> 01:26:19.000
Idation. So would you, just, you know, do the fit, and then transform on the whole train set, and then split again into the train, and validation.

01:26:19.000 --> 01:26:24.000
Yeah. So the validation set or the holdout set from across validation.

01:26:24.000 --> 01:26:28.000
That is sort of mimicking the role of a test set.

01:26:28.000 --> 01:26:29.000
So each time through the crossfalidation, you would have to fit the scalar.

01:26:29.000 --> 01:26:41.000
Then excluding that validation split so like.

01:26:41.000 --> 01:26:42.000
Yeah.

01:26:42.000 --> 01:26:43.000
Okay. So you would make sure to not fit the validation set.

01:26:43.000 --> 01:26:49.000
Okay. I was thinking you would use the standard scalar before you do the cross validation. But.

01:26:49.000 --> 01:26:53.000
Yup, and then next week, because we're out of time today.

01:26:53.000 --> 01:27:02.000
Next week we'll learn about pipelines and, like those are like a nice way to basically like they wouldn't include the standard scalar as like a part of your model.

01:27:02.000 --> 01:27:13.000
So it'll be easier to rem like. Don't fit your you know your standard scalar on these holes.

01:27:13.000 --> 01:27:18.000
Great any other questions about anything.

01:27:18.000 --> 01:27:19.000
Yeah.

01:27:19.000 --> 01:27:22.000
Yes, I have a question. So eventually we would transform the this set also.

01:27:22.000 --> 01:27:31.000
So I was just wondering eventually going to make your prediction, because in dirty word, you want to see actually the baby or the reader.

01:27:31.000 --> 01:27:33.000
Not that I have transformed the desket. Also? Does it affects, like everybody are gonna be seen for linear regression.

01:27:33.000 --> 01:27:45.000
For example. Now, if you have transformed your data, your data to in number between 0 and one, now you, interview what you want to see, maybe like 100 and so on.

01:27:45.000 --> 01:27:48.000
So I think you go back.

01:27:48.000 --> 01:27:56.000
Yeah. So doing the like, the coefficients you would have to trans like a lot of times right in linear regression.

01:27:56.000 --> 01:28:06.000
You want to interpret the coefficients to get a sense of if I increase, you know, if I increase expenditures by 2, then I, you know, expect whatever increase in profits you know that sort of thing.

01:28:06.000 --> 01:28:13.000
So when you do this sort of scaling process, you lose the ability to directly interpret the coefficients.

01:28:13.000 --> 01:28:28.000
In the scale of the original data, you would have to do sort of like a backwards process to the coefficients, to sort of re-engineer like what it means in the original scale.

01:28:28.000 --> 01:28:30.000
I see that that makes sense.

01:28:30.000 --> 01:28:32.000
Yeah.

01:28:32.000 --> 01:28:33.000
Thank you.

01:28:33.000 --> 01:28:42.000
Of course. Alright! If there are no other questions, I'm gonna stop the recording, and then I'll hang back for a few minutes to answer like questions of that.

01:28:42.000 --> 01:28:44.000
People maybe didn't want recorded.

01:28:44.000 --> 01:28:50.000
Okay.